A computational framework to assess genome-wide distribution of polymorphic human endogenous retrovirus-K in human populations

Mapping Intimacies ◽

10.1101/444034 ◽

2018 ◽

Cited By ~ 1

Author(s):

Weiling Li ◽

Lin Lin ◽

Raunaq Malhotra ◽

Lei Yang ◽

Raj Acharya ◽

...

Keyword(s):

Disease Risk ◽

Sequence Data ◽

Endogenous Retrovirus ◽

Genomic Diversity ◽

Human Endogenous Retrovirus ◽

Human Populations ◽

Whole Genome ◽

Short Read ◽

Reference Set ◽

Type K

AbstractHuman Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population frequency of HERV-K provirus at each site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set ofk-mersunique to each HERV-K loci and applies mixture model-based clustering to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the frequency of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the prevalence of any combination of HERV-K among KGP populations. Further, the genome burden of polymorphic HERV-K is variable in humans, with East Asian (EAS) individuals having the fewest integration sites. Our study identifies population-specific sequence variation for several HERV-K proviruses. We expect these resources will advance research on HERV-K contributions to human diseases.Author summaryHuman Endogenous Retrovirus type K (HERV-K) is the youngest of retrovirus families in the human genome and is the only group that is polymorphic; a HERV-K can be present in one individual but absent from others. HERV-Ks could contribute to disease risk but establishing a link of a polymorphic HERV-K to a specific disease has been difficult. We develop an easy to use method that reveals the considerable variation existing among global populations in the frequency of individual and co-occurring polymorphic HERV-K, and in the total number of HERV-K that any individual has in their genome. Our study provides a global reference set of HERV-K genomic diversity and tools needed to determine the genomic landscape of HERV-K in any patient population.

Download Full-text

New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica

10.1101/003764 ◽

2014 ◽

Cited By ~ 4

Author(s):

Michael C Schatz ◽

Lyza G Maron ◽

Joshua C Stein ◽

Alejandro Hernandez Wences ◽

James Gurtowski ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Sequence Data ◽

Biological Properties ◽

Genomic Diversity ◽

Reference Sequence ◽

Human Populations ◽

Whole Genome ◽

Rice Varieties ◽

Assembly Technology

The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. Currently, when the genomes of different strains of a given organism are compared, whole genome resequencing data are aligned to an established reference sequence. However when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate. Here, we use rice as a model to explore the extent of structural variation among strains adapted to different ecologies and geographies, and show that this variation can be significant, often matching or exceeding the variation present in closely related human populations or other mammals. We demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared to provide an unbiased assessment. Using this approach, we are able to accurately assess the ?pan-genome? of three divergent rice varieties and document several megabases of each genome absent in the other two. Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard resequencing approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.

Download Full-text

Expression of human endogenous retrovirus type K envelope glycoprotein in insect and mammalian cells.

Journal of Virology ◽

10.1128/jvi.71.4.2747-2756.1997 ◽

1997 ◽

Vol 71 (4) ◽

pp. 2747-2756 ◽

Cited By ~ 18

Author(s):

R R Tönjes ◽

C Limbach ◽

R Löwer ◽

R Kurth

Keyword(s):

Envelope Glycoprotein ◽

Mammalian Cells ◽

Endogenous Retrovirus ◽

Human Endogenous Retrovirus ◽

Type K

Download Full-text

Identification and Characterization of Novel Human Endogenous Retrovirus Families by Phylogenetic Screening of the Human Genome Mapping Project Database

Journal of Virology ◽

10.1128/jvi.74.8.3715-3730.2000 ◽

2000 ◽

Vol 74 (8) ◽

pp. 3715-3730 ◽

Cited By ~ 202

Author(s):

Michael Tristem

Keyword(s):

Human Genome ◽

Genome Mapping ◽

Sequence Data ◽

Endogenous Retrovirus ◽

Endogenous Retroviruses ◽

Human Endogenous Retrovirus ◽

Sequence Information ◽

Class Iii ◽

Genome Mapping Project ◽

Human Genome Mapping Project

ABSTRACT Human endogenous retroviruses (HERVs) were first identified almost 20 years ago, and since then numerous families have been described. It has, however, been difficult to obtain a good estimate of both the total number of independently derived families and their relationship to each other as well as to other members of the familyRetroviridae. In this study, I used sequence data derived from over 150 novel HERVs, obtained from the Human Genome Mapping Project database, and a variety of recently identified nonhuman retroviruses to classify the HERVs into 22 independently acquired families. Of these, 17 families were loosely assigned to the class I HERVs, 3 to the class II HERVs and 2 to the class III HERVs. Many of these families have been identified previously, but six are described here for the first time and another four, for which only partial sequence information was previously available, were further characterized. Members of each of the 10 families are defective, and calculation of their integration dates suggested that most of them are likely to have been present within the human lineage since it diverged from the Old World monkeys more than 25 million years ago.

Download Full-text

Abstract 1257: Human endogenous retrovirus type K (HERV-K)envprotein as a vaccine target for HERV-K+ cancer prevention

10.1158/1538-7445.am2018-1257 ◽

2018 ◽

Author(s):

Feng Wang-Johanning ◽

Jia Li ◽

Ming Li ◽

Gary L. Johanning ◽

Albert Lee ◽

...

Keyword(s):

Cancer Prevention ◽

Endogenous Retrovirus ◽

Human Endogenous Retrovirus ◽

Vaccine Target ◽

Type K

Download Full-text

Characterization of Human Endogenous Retrovirus Type K Virus-like Particles Generated from Recombinant Baculoviruses

Virology ◽

10.1006/viro.1997.8614 ◽

1997 ◽

Vol 233 (2) ◽

pp. 280-291 ◽

Cited By ~ 24

Author(s):

Ralf R. Tönjes ◽

Klaus Boller ◽

Christiane Limbach ◽

Raimond Lugert ◽

Reinhard Kurth

Keyword(s):

Endogenous Retrovirus ◽

Recombinant Baculoviruses ◽

Human Endogenous Retrovirus ◽

Virus Like Particles ◽

Type K

Download Full-text

Expression of Human Endogenous Retrovirus Type K Envelope Protein is a Novel Candidate Prognostic Marker for Human Breast Cancer

Genes & Cancer ◽

10.1177/1947601911431841 ◽

2011 ◽

Vol 2 (9) ◽

pp. 914-922 ◽

Cited By ~ 40

Author(s):

J. Zhao ◽

K. Rycaj ◽

S. Geng ◽

M. Li ◽

J. B. Plummer ◽

...

Keyword(s):

Breast Cancer ◽

Prognostic Marker ◽

Envelope Protein ◽

Human Breast Cancer ◽

Endogenous Retrovirus ◽

Human Endogenous Retrovirus ◽

Human Breast ◽

Type K

Download Full-text

Identification of meiotic recombination through gamete genome reconstruction using whole genome linked-reads

10.1101/363341 ◽

2018 ◽

Author(s):

Peng Xu ◽

Zechen Chong ◽

Keyword(s):

Meiotic Recombination ◽

Haplotype Diversity ◽

Genomic Analysis ◽

Genomic Diversity ◽

Pedigree Information ◽

Human Populations ◽

Whole Genome ◽

Homologous Chromosomes ◽

Template Strand ◽

Recombination Hotspots

AbstractMeiotic recombination (MR), which transmits exchanged genetic materials between homologous chromosomes to offspring, plays a crucial role in shaping genomic diversity in eukaryotic organisms. In humans, thousands of meiotic recombination hotspots have been mapped by population genetics approaches. However, direct identification of MR events for individuals is still challenging due to the difficulty in resolving the haplotypes of homologous chromosomes and reconstructing the gamete genome. Whole genome linked-read sequencing (lrWGS) can generate haplotype sequences of mega-base pairs (N50 ~2.5Mb) after computational phasing. However, the haplotype information is still isolated in a large number of fragmented genomic regions and limited by switch errors, impeding its further application in the chromosome-scale analysis. In this study, we developed a tool MRLR (Meiotic Recombination identification by Linked-Read sequencing) for the analysis of individual MR events. By leveraging trio pedigree information with lrWGS haplotypes, our pipeline is sufficient to reconstruct the whole human gamete genome with 99.8% haplotyping accuracy. By analyzing the haplotype exchange between homologous chromosomes, MRLR identified 462 high-resolution MR events in 6 human trio samples from the Genome In A Bottle (GIAB) and the Human Genome Structural Variation Consortium (HGSVC). In three datasets of the HGSVC, our results recapitulated 149 (92%) previously identified high-confident MR events and discovered 85 novel events. About half (40) of the new events are supported by single-cell template strand sequencing (Strand-seq) results. We found that 332 (71.9%) MR events co-localize with recombination hotspots (>10 cM/Mb) in human populations, and MR breakpoint regions are enriched in PRDM9 and DMC1 binding sites. In addition, 48% (221) breakpoint regions were detected inside a gene, indicating these MRs can directly affect the haplotype diversity of genic regions. Taken together, our approach provides new opportunities in the haplotype-based genomic analysis of individual meiotic recombination. The MRLR software is implemented in Perl and is freely available at https://github.com/ChongLab/MRLR.

Download Full-text

Detection of long repeat expansions from PCR-free whole-genome sequence data

10.1101/093831 ◽

2016 ◽

Cited By ~ 3

Author(s):

Egor Dolzhenko ◽

Joke J.F.A. van Vugt ◽

Richard J. Shaw ◽

Mitchell A. Bekritsky ◽

Marka van Blitterswijk ◽

...

Keyword(s):

Fragile X Syndrome ◽

Sequence Data ◽

Fragile X ◽

Software Tool ◽

Whole Genome Sequence ◽

Read Length ◽

Whole Genome ◽

Wild Type ◽

Short Read ◽

Repeat Expansions

AbstractIdentifying large repeat expansions such as those that cause amyotrophic lateral sclerosis (ALS) and Fragile X syndrome is challenging for short-read (100-150 bp) whole genome sequencing (WGS) data. A solution to this problem is an important step towards integrating WGS into precision medicine. We have developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3,001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Taking the RP-PCR calls as the ground truth, our WGS-based method identified pathogenic repeat expansions with 98.1% sensitivity and 99.7% specificity. Further inspection identified that all 11 conflicts were resolved as errors in the original RP-PCR results. Compared against this updated result, ExpansionHunter correctly classified all (212/212) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2,786/2,789) of the wild type samples were correctly classified as wild type by this method with the remaining two identified as possible expansions. We further applied our algorithm to a set of 144 samples where every sample had one of eight different pathogenic repeat expansions including examples associated with fragile X syndrome, Friedreich’s ataxia and Huntington’s disease and correctly flagged all of the known repeat expansions. Finally, we tested the accuracy of our method for short repeats by comparing our genotypes with results from 860 samples sized using fragment length analysis and determined that our calls were >95% accurate. ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.

Download Full-text

A benchmarking of human mitochondrial DNA haplogroup classifiers from whole-genome and whole-exome sequence data

10.1101/2021.02.11.430775 ◽

2021 ◽

Author(s):

Víctor García-Olivares ◽

Adrián Muñoz-Barrera ◽

José Miguel Lorenzo-Salazar ◽

Carlos Zaragoza-Trello ◽

Luis A. Rubio-Rodríguez ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Data ◽

Qualitative Assessment ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Short Read ◽

Bioinformatic Tools ◽

Whole Exome

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.

Download Full-text

First phylogenetic analysis of Malian SARS-CoV-2 sequences provide molecular insights into the genomic diversity of the Sahel region

10.1101/2020.09.23.20165639 ◽

2020 ◽

Author(s):

Bourema Kouriba ◽

Angela Duerr ◽

Alexandra Rehn ◽

Abdoul Karim Sangare ◽

Brehima Youssouf Traoure ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Genome Sequencing ◽

Sequence Data ◽

Genomic Diversity ◽

Whole Genome ◽

Sequencing Data ◽

Genome Sequences ◽

Spreading Dynamics ◽

Sahel Region ◽

Limited Sequence

We are currently facing a pandemic of COVID-19, caused by a spillover from an animal-originating coronavirus to humans occuring in the Wuhan region, China, in December 2019. From China the virus has spread to 188 countries and regions worldwide, reaching the Sahel region on the 2nd of March 2020. Since whole genome sequencing (WGS) data is very crucial to understand the spreading dynamics of the ongoing pandemic, but only limited sequence data is available from the Sahel region to date, we have focused our efforts on generating the first Malian sequencing data available. Screening of 217 Malian patient samples for the presence of SARS-CoV-2 resulted in 38 positive isolates from which 21 whole genome sequences were generated. Our analysis shows that both, the early A (19B) and the fast evolving B (20A/C) clade, are present in Mali indicating multiple and independent introductions of the SARS-CoV-2 to the Sahel region.

Download Full-text