Improved genome inference in the MHC using a population reference graph

Mapping Intimacies ◽

10.1101/006973 ◽

2014 ◽

Cited By ~ 3

Author(s):

Alexander Dilthey ◽

Charles Cox ◽

Zamin Iqbal ◽

Matthew R. Nelson ◽

Gil McVean

Keyword(s):

Hidden Markov ◽

Structural Diversity ◽

Chromosome 6 ◽

Genome Sequences ◽

1000 Genomes ◽

Reference Quality ◽

Reference Sequences ◽

Multiple Reference ◽

Short Indels ◽

Reference Graph

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and longread data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

Download Full-text

Respiratory syncytial virus A genotype classification based on systematic intergenotypic and intragenotypic sequence analysis

Scientific Reports ◽

10.1038/s41598-019-56552-2 ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Juan Carlos Muñoz-Escalante ◽

Andreu Comas-García ◽

Sofía Bernal-Silva ◽

Carla Daniela Robles-Espinoza ◽

Guillermo Gómez-Leal ◽

...

Keyword(s):

Respiratory Syncytial Virus ◽

Respiratory Tract Infections ◽

Viral Evolution ◽

Lower Respiratory Tract Infections ◽

Genome Sequences ◽

Distance Analysis ◽

Genotype Identification ◽

Reference Sequences ◽

Syncytial Virus ◽

Tract Infections

AbstractRespiratory syncytial virus (RSV), a leading cause of lower respiratory tract infections, is classified in two major groups (A and B) with multiple genotypes within them. Continuous changes in spatiotemporal distribution of RSV genotypes have been recorded since the identification of this virus. However, there are no established criteria for genotype definition, which affects the understanding of viral evolution, immunity, and development of vaccines. We conducted a phylogenetic analysis of 4,353 RSV-A G gene ectodomain sequences, and used 1,103 complete genome sequences to analyze the totallity of RSV-A genes. Intra- and intergenotype p-distance analysis and identification of molecular markers associated to specific genotypes were performed. Our results indicate that previously reported genotypes can be classified into nine distinct genotypes: GA1-GA7, SAA1, and NA1. We propose the analysis of the G gene ectodomain with a wide set of reference sequences of all genotypes for an accurate genotype identification.

Download Full-text

Incomplete Nucleic Acid Sequences Visualization: A Case Study in Virus Sequences

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.781.637 ◽

2015 ◽

Vol 781 ◽

pp. 637-640

Author(s):

Thitiwat Piyatamrong ◽

Anan Kamolphanus ◽

Gasydech Lergchinnaboot ◽

Krittin Suphakarn ◽

Chivalai Temiyasathit

Keyword(s):

Nucleic Acid ◽

Dengue Virus ◽

East Asian ◽

Majority Voting ◽

Genome Sequences ◽

The World ◽

Database Technology ◽

Virus Sequence ◽

Reference Sequences

Dengue virus (DENV) is one of the most widespread infectious diseases in the world, especially in the South East Asian regions. Transmitting the virus through mosquitoes, Dengue is an infectious viral borne disease. The virus sequences are assembled as series of nucleic acid, making the task of diagnosing virus sequences burdensome. Graphical representations are then proposed to represent Dengue virus to sustain the studies in virus sequences diagnosis. However, graphically representing sequences remained a crucified task especially for the incomplete genome sequences due to the missing nucleic acids. Although a number of studies provide methodologies on virus sequence visualization, in Dengue virus researches, those methodologies provide the visualization solely for complete genome sequences while neglecting the incomplete genome sequences. With the unaccommodating availabilities of research inputs, our study proposes a methodology for graphically representing the incomplete Dengue virus sequences, as well as complete virus sequences, by imputing in the incomplete part of a sequence with created reference sequences. The proposed methodology employs the use of database technology and majority voting technique to create reference sequences for each serotype of Dengue. Experimental results show that incomplete sequences are visualized realistically according to its respective serotype, thus providing flexibilities in Dengue virus researches to compensate incomplete sequences as inputs.

Download Full-text

Compressing population DNA sequences using multiple reference sequences

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipa.2017.8282136 ◽

2017 ◽

Author(s):

Kin-On Cheng ◽

Ngai-Fong Law ◽

Wan-Chi Siu

Keyword(s):

Dna Sequences ◽

Reference Sequences ◽

Multiple Reference

Download Full-text

Draft Genome Sequences of Five Enterococcus Species Isolated from the Gut of Patients with Suspected Clostridium difficile Infection

Genome Announcements ◽

10.1128/genomea.00379-17 ◽

2017 ◽

Vol 5 (20) ◽

Author(s):

Eduardo Castro-Nallar ◽

Sandro L. Valenzuela ◽

Sebastián Baquedano ◽

Carolina Sánchez ◽

Fabiola Fernández ◽

...

Keyword(s):

Antibiotic Resistance ◽

Clostridium Difficile ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Draft Genome ◽

Enterococcus Species ◽

Beta Lactams ◽

Genome Sequences ◽

Content Type

ABSTRACT We present draft genome sequences of five Enterococcus species from patients suspected of Clostridium difficile infection. Genome completeness was confirmed by presence of bacterial orthologs (97%). Gene searches using Hidden-Markov models revealed that the isolates harbor between seven and 11 genes involved in antibiotic resistance to tetracyclines, beta-lactams, and vancomycin.

Download Full-text

Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure

Journal of Molecular Biology ◽

10.1006/jmbi.2001.5080 ◽

2001 ◽

Vol 313 (4) ◽

pp. 903-919 ◽

Cited By ~ 732

Author(s):

Julian Gough ◽

Kevin Karplus ◽

Richard Hughey ◽

Cyrus Chothia

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Genome Sequences

Download Full-text

ReGSP: a visualized application for homology-based gene searching and plotting using multiple reference sequences

PeerJ ◽

10.7717/peerj.12707 ◽

2021 ◽

Vol 9 ◽

pp. e12707

Author(s):

Girum Fitihamlak Ejigu ◽

Gangman Yi ◽

Jong Im Kim ◽

Jaehee Jung

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Homology Search ◽

Sequencing Technologies ◽

Gene Search ◽

Organelle Genomes ◽

Genome Search ◽

Reference Sequences ◽

Genome Level ◽

Multiple Reference

The massively parallel nature of next-generation sequencing technologies has contributed to the generation of massive sequence data in the last two decades. Deciphering the meaning of each generated sequence requires multiple analysis tools, at all stages of analysis, from the reads stage all the way up to the whole-genome level. Homology-based approaches based on related reference sequences are usually the preferred option for gene and transcript prediction in newly sequenced genomes, resulting in the popularity of a variety of BLAST and BLAST-based tools. For organelle genomes, a single-reference–based gene finding tool that uses grouping parameters for BLAST results has been implemented in the Genome Search Plotter (GSP). However, this tool does not accept multiple and user-customized reference sequences required for a broad homology search. Here, we present multiple Reference–based Gene Search and Plot (ReGSP), a simple and convenient web tool that accepts multiple reference sequences for homology-based gene search. The tool incorporates cPlot, a novel dot plot tool, for illustrating nucleotide sequence similarity between the query and the reference sequences. ReGSP has an easy-to-use web interface and is freely accessible at https://ds.mju.ac.kr/regsp.

Download Full-text

Complete Genome Sequences of Four Escherichia coli ST95 Isolates from Bloodstream Infections

Genome Announcements ◽

10.1128/genomea.01241-15 ◽

2015 ◽

Vol 3 (6) ◽

Cited By ~ 10

Author(s):

Craig M. Stephens ◽

Jeffrey M. Skerker ◽

Manraj S. Sekhon ◽

Adam P. Arkin ◽

Lee W. Riley

Keyword(s):

Escherichia Coli ◽

General Hospital ◽

San Francisco ◽

Antimicrobial Susceptibility ◽

Complete Genome ◽

Bloodstream Infections ◽

Sequence Type ◽

Genome Sequences ◽

Reference Sequences

Finished genome sequences are presented for four Escherichia coli strains isolated from bloodstream infections at San Francisco General Hospital. These strains provide reference sequences for four major fimH -identified sublineages within the multilocus sequence type (MLST) ST95 group, and provide insights into pathogenicity and differential antimicrobial susceptibility within this group.

Download Full-text

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Bioinformatics ◽

10.1093/bioinformatics/btaa458 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i12-i20 ◽

Cited By ~ 2

Author(s):

Vitor C Piro ◽

Temesgen H Dadi ◽

Enrico Seiler ◽

Knut Reinert ◽

Bernhard Y Renard

Keyword(s):

State Of The Art ◽

Hierarchical Classification ◽

Bloom Filters ◽

Supplementary Information ◽

Sequence Classification ◽

Supplementary Data ◽

High Complexity ◽

Genome Sequences ◽

Reference Sequences ◽

Classification Tool

Abstract Motivation The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. Results Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. Availability and implementation The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Novel functional sequences uncovered through a bovine multi-assembly graph

10.1101/2021.01.08.425845 ◽

2021 ◽

Author(s):

Danang Crysnanto ◽

Alexander S. Leonard ◽

Zih-Hua Fang ◽

Hubert Pausch

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Differentially Expressed ◽

Reference Quality ◽

Taurine Cattle ◽

Novel Transcripts ◽

Reference Sequences ◽

Close Relatives ◽

Reference Genomes ◽

The Way

Linear reference genomes are typically assembled from single individuals. They are unable to reflect the genetic diversity of populations and lack millions of bases. To overcome such limitations and make non-reference sequences amenable to genetic investigations, we build a multi-assembly graph from six reference-quality assemblies from taurine cattle and their close relatives. We uncover 70,329,827 bases that are missing in the bovine linear reference genome. The missing sequences encode novel transcripts that are differentially expressed between individual animals. Reads which were previously poorly or unmapped against the bovine reference genome now align accurately to the non-reference sequences. We show that the non-reference sequences contain polymorphic sites that segregate within and between breeds of cattle. Our efforts to uncover novel functional sequences from a multi-assembly graph pave the way towards the transition to a more representative bovine reference genome.

Download Full-text

Signatures of recent positive selection in enhancers across 41 human tissues

10.1101/534461 ◽

2019 ◽

Author(s):

Jiyun M. Moon ◽

John A. Capra ◽

Patrick Abbot ◽

Antonis Rokas

Keyword(s):

Positive Selection ◽

Evolutionary History ◽

Multiple Time ◽

Human Tissues ◽

Genome Sequences ◽

Evolutionary Time ◽

1000 Genomes ◽

Recent Positive Selection ◽

Evolutionary Changes ◽

Functional Context

AbstractEvolutionary changes in enhancers are widely associated with variation in human traits and diseases. However, studies comprehensively quantifying levels of selection on enhancers at multiple evolutionary time points during recent human evolution and how enhancer evolution varies across human tissues are lacking. To address these questions, we integrated a dataset of 41,561 transcribed enhancers active in 41 different human tissues (FANTOM Consortium) with whole genome sequences of 1,668 individuals from the African, Asian, and European populations (1000 Genomes Project). Our analyses based on four different metrics (Tajima’s D, FST, H12, nSL) showed that ~5.90% of enhancers considered showed evidence of recent positive selection and that genes associated with enhancers under positive selection are enriched for diverse immune-related functions. The distributions of these metrics for brain and testis enhancers were often statistically significantly different compared to those of other tissues; the same was true for brain and testis enhancers that are tissue-specific compared to those that are tissue-broad and for testis enhancers associated with tissue-enriched and non-tissue-enriched genes. These differences varied considerably across metrics and tissues and were generally due to changes in distributions’ shapes rather than shifts in their values. These results suggest that many human enhancers experienced recent positive selection throughout multiple time periods in human evolutionary history, that this selection occurred in a tissue-dependent and immune-related functional context, and that much like the evolution of their coding counterparts, the evolution of brain and testis enhancers has been markedly different from that of enhancers in other tissues.

Download Full-text