scholarly journals Improved genome inference in the MHC using a population reference graph

2014 ◽  
Author(s):  
Alexander Dilthey ◽  
Charles Cox ◽  
Zamin Iqbal ◽  
Matthew R. Nelson ◽  
Gil McVean

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and longread data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Juan Carlos Muñoz-Escalante ◽  
Andreu Comas-García ◽  
Sofía Bernal-Silva ◽  
Carla Daniela Robles-Espinoza ◽  
Guillermo Gómez-Leal ◽  
...  

AbstractRespiratory syncytial virus (RSV), a leading cause of lower respiratory tract infections, is classified in two major groups (A and B) with multiple genotypes within them. Continuous changes in spatiotemporal distribution of RSV genotypes have been recorded since the identification of this virus. However, there are no established criteria for genotype definition, which affects the understanding of viral evolution, immunity, and development of vaccines. We conducted a phylogenetic analysis of 4,353 RSV-A G gene ectodomain sequences, and used 1,103 complete genome sequences to analyze the totallity of RSV-A genes. Intra- and intergenotype p-distance analysis and identification of molecular markers associated to specific genotypes were performed. Our results indicate that previously reported genotypes can be classified into nine distinct genotypes: GA1-GA7, SAA1, and NA1. We propose the analysis of the G gene ectodomain with a wide set of reference sequences of all genotypes for an accurate genotype identification.


2015 ◽  
Vol 781 ◽  
pp. 637-640
Author(s):  
Thitiwat Piyatamrong ◽  
Anan Kamolphanus ◽  
Gasydech Lergchinnaboot ◽  
Krittin Suphakarn ◽  
Chivalai Temiyasathit

Dengue virus (DENV) is one of the most widespread infectious diseases in the world, especially in the South East Asian regions. Transmitting the virus through mosquitoes, Dengue is an infectious viral borne disease. The virus sequences are assembled as series of nucleic acid, making the task of diagnosing virus sequences burdensome. Graphical representations are then proposed to represent Dengue virus to sustain the studies in virus sequences diagnosis. However, graphically representing sequences remained a crucified task especially for the incomplete genome sequences due to the missing nucleic acids. Although a number of studies provide methodologies on virus sequence visualization, in Dengue virus researches, those methodologies provide the visualization solely for complete genome sequences while neglecting the incomplete genome sequences. With the unaccommodating availabilities of research inputs, our study proposes a methodology for graphically representing the incomplete Dengue virus sequences, as well as complete virus sequences, by imputing in the incomplete part of a sequence with created reference sequences. The proposed methodology employs the use of database technology and majority voting technique to create reference sequences for each serotype of Dengue. Experimental results show that incomplete sequences are visualized realistically according to its respective serotype, thus providing flexibilities in Dengue virus researches to compensate incomplete sequences as inputs.


2017 ◽  
Vol 5 (20) ◽  
Author(s):  
Eduardo Castro-Nallar ◽  
Sandro L. Valenzuela ◽  
Sebastián Baquedano ◽  
Carolina Sánchez ◽  
Fabiola Fernández ◽  
...  

ABSTRACT We present draft genome sequences of five Enterococcus species from patients suspected of Clostridium difficile infection. Genome completeness was confirmed by presence of bacterial orthologs (97%). Gene searches using Hidden-Markov models revealed that the isolates harbor between seven and 11 genes involved in antibiotic resistance to tetracyclines, beta-lactams, and vancomycin.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12707
Author(s):  
Girum Fitihamlak Ejigu ◽  
Gangman Yi ◽  
Jong Im Kim ◽  
Jaehee Jung

The massively parallel nature of next-generation sequencing technologies has contributed to the generation of massive sequence data in the last two decades. Deciphering the meaning of each generated sequence requires multiple analysis tools, at all stages of analysis, from the reads stage all the way up to the whole-genome level. Homology-based approaches based on related reference sequences are usually the preferred option for gene and transcript prediction in newly sequenced genomes, resulting in the popularity of a variety of BLAST and BLAST-based tools. For organelle genomes, a single-reference–based gene finding tool that uses grouping parameters for BLAST results has been implemented in the Genome Search Plotter (GSP). However, this tool does not accept multiple and user-customized reference sequences required for a broad homology search. Here, we present multiple Reference–based Gene Search and Plot (ReGSP), a simple and convenient web tool that accepts multiple reference sequences for homology-based gene search. The tool incorporates cPlot, a novel dot plot tool, for illustrating nucleotide sequence similarity between the query and the reference sequences. ReGSP has an easy-to-use web interface and is freely accessible at https://ds.mju.ac.kr/regsp.


2015 ◽  
Vol 3 (6) ◽  
Author(s):  
Craig M. Stephens ◽  
Jeffrey M. Skerker ◽  
Manraj S. Sekhon ◽  
Adam P. Arkin ◽  
Lee W. Riley

Finished genome sequences are presented for four Escherichia coli strains isolated from bloodstream infections at San Francisco General Hospital. These strains provide reference sequences for four major fimH -identified sublineages within the multilocus sequence type (MLST) ST95 group, and provide insights into pathogenicity and differential antimicrobial susceptibility within this group.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i12-i20 ◽  
Author(s):  
Vitor C Piro ◽  
Temesgen H Dadi ◽  
Enrico Seiler ◽  
Knut Reinert ◽  
Bernhard Y Renard

Abstract Motivation The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. Results Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. Availability and implementation The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Danang Crysnanto ◽  
Alexander S. Leonard ◽  
Zih-Hua Fang ◽  
Hubert Pausch

Linear reference genomes are typically assembled from single individuals. They are unable to reflect the genetic diversity of populations and lack millions of bases. To overcome such limitations and make non-reference sequences amenable to genetic investigations, we build a multi-assembly graph from six reference-quality assemblies from taurine cattle and their close relatives. We uncover 70,329,827 bases that are missing in the bovine linear reference genome. The missing sequences encode novel transcripts that are differentially expressed between individual animals. Reads which were previously poorly or unmapped against the bovine reference genome now align accurately to the non-reference sequences. We show that the non-reference sequences contain polymorphic sites that segregate within and between breeds of cattle. Our efforts to uncover novel functional sequences from a multi-assembly graph pave the way towards the transition to a more representative bovine reference genome.


2019 ◽  
Author(s):  
Jiyun M. Moon ◽  
John A. Capra ◽  
Patrick Abbot ◽  
Antonis Rokas

AbstractEvolutionary changes in enhancers are widely associated with variation in human traits and diseases. However, studies comprehensively quantifying levels of selection on enhancers at multiple evolutionary time points during recent human evolution and how enhancer evolution varies across human tissues are lacking. To address these questions, we integrated a dataset of 41,561 transcribed enhancers active in 41 different human tissues (FANTOM Consortium) with whole genome sequences of 1,668 individuals from the African, Asian, and European populations (1000 Genomes Project). Our analyses based on four different metrics (Tajima’s D, FST, H12, nSL) showed that ~5.90% of enhancers considered showed evidence of recent positive selection and that genes associated with enhancers under positive selection are enriched for diverse immune-related functions. The distributions of these metrics for brain and testis enhancers were often statistically significantly different compared to those of other tissues; the same was true for brain and testis enhancers that are tissue-specific compared to those that are tissue-broad and for testis enhancers associated with tissue-enriched and non-tissue-enriched genes. These differences varied considerably across metrics and tissues and were generally due to changes in distributions’ shapes rather than shifts in their values. These results suggest that many human enhancers experienced recent positive selection throughout multiple time periods in human evolutionary history, that this selection occurred in a tissue-dependent and immune-related functional context, and that much like the evolution of their coding counterparts, the evolution of brain and testis enhancers has been markedly different from that of enhancers in other tissues.


Sign in / Sign up

Export Citation Format

Share Document