A binning tool to reconstruct viral haplotypes from assembled contigs

Mapping Intimacies ◽

10.1101/704288 ◽

2019 ◽

Author(s):

Jiao Chen ◽

Jiayu Shang ◽

Jianrong Wang ◽

Yanni Sun

Keyword(s):

Genetic Diversity ◽

Rna Viruses ◽

Sequence Similarity ◽

Biological Properties ◽

Sequencing Data ◽

High Sequence Similarity ◽

Effective Prevention ◽

Next Generation Sequencing Technology ◽

Sequence Composition ◽

Genome Scale

AbstractMotivationInfections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed.ResultsWe developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction.Availabilityhttps://github.com/chjiao/[email protected]

Download Full-text

A binning tool to reconstruct viral haplotypes from assembled contigs

BMC Bioinformatics ◽

10.1186/s12859-019-3138-1 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Jiao Chen ◽

Jiayu Shang ◽

Jianrong Wang ◽

Yanni Sun

Keyword(s):

Genetic Diversity ◽

Rna Viruses ◽

Sequence Similarity ◽

Biological Properties ◽

Sequencing Data ◽

High Sequence Similarity ◽

Effective Prevention ◽

Next Generation Sequencing Technology ◽

Sequence Composition ◽

Genome Scale

Abstract Background Infections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed. Results We developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction. Conclusions In this work, we presented VirBin, a new contig binning tool for distinguishing contigs from different viral haplotypes with high sequence similarity. It competes favorably with other tools on viral contig binning. The source codes are available at: https://github.com/chjiao/VirBin.

Download Full-text

Efficient CRISPR/Cas9 mediated Pooled-sgRNAs assembly accelerates targeting multiple genes related to male sterility in cotton

Plant Methods ◽

10.1186/s13007-021-00712-x ◽

2021 ◽

Vol 17 (1) ◽

Author(s):

Mohamed Ramadan ◽

Muna Alariqi ◽

Yizan Ma ◽

Yanlong Li ◽

Zhenping Liu ◽

...

Keyword(s):

Genetic Transformation ◽

Sequence Similarity ◽

High Specificity ◽

Transformation Method ◽

High Sequence Similarity ◽

Single Experiment ◽

Next Generation Sequencing Technology ◽

Cotton Transformation ◽

Assembly Method ◽

Multiple Genes

Abstract Background Upland cotton (Gossypium hirsutum), harboring a complex allotetraploid genome, consists of A and D sub-genomes. Every gene has multiple copies with high sequence similarity that makes genetic, genomic and functional analyses extremely challenging. The recent accessibility of CRISPR/Cas9 tool provides the ability to modify targeted locus efficiently in various complicated plant genomes. However, current cotton transformation method targeting one gene requires a complicated, long and laborious regeneration process. Hence, optimizing strategy that targeting multiple genes is of great value in cotton functional genomics and genetic engineering. Results To target multiple genes in a single experiment, 112 plant development-related genes were knocked out via optimized CRISPR/Cas9 system. We optimized the key steps of pooled sgRNAs assembly method by which 116 sgRNAs pooled together into 4 groups (each group consisted of 29 sgRNAs). Each group of sgRNAs was compiled in one PCR reaction which subsequently went through one round of vector construction, transformation, sgRNAs identification and also one round of genetic transformation. Through the genetic transformation mediated Agrobacterium, we successfully generated more than 800 plants. For mutants identification, Next Generation Sequencing technology has been used and results showed that all generated plants were positive and all targeted genes were covered. Interestingly, among all the transgenic plants, 85% harbored a single sgRNA insertion, 9% two insertions, 3% three different sgRNAs insertions, 2.5% mutated sgRNAs. These plants with different targeted sgRNAs exhibited numerous combinations of phenotypes in plant flowering tissues. Conclusion All targeted genes were successfully edited with high specificity. Our pooled sgRNAs assembly offers a simple, fast and efficient method/strategy to target multiple genes in one time and surely accelerated the study of genes function in cotton.

Download Full-text

The Complete Chloroplast Genome of the Vulnerable Oreocharis esquirolii (Gesneriaceae): Structural Features, Comparative and Phylogenetic Analysis

Plants ◽

10.3390/plants9121692 ◽

2020 ◽

Vol 9 (12) ◽

pp. 1692

Author(s):

Li Gu ◽

Ting Su ◽

Ming-Tai An ◽

Guo-Xiong Hu

Keyword(s):

Phylogenetic Analysis ◽

Sequence Similarity ◽

Single Copy ◽

Structural Features ◽

Rrna Genes ◽

Trna Genes ◽

Sequencing Data ◽

High Sequence Similarity ◽

Plastid Genomes ◽

Cp Genome

Oreocharis esquirolii, a member of Gesneriaceae, is known as Thamnocharis esquirolii, which has been regarded a synonym of the former. The species is endemic to Guizhou, southwestern China, and is evaluated as vulnerable (VU) under the International Union for Conservation of Nature (IUCN) criteria. Until now, the sequence and genome information of O. esquirolii remains unknown. In this study, we assembled and characterized the complete chloroplast (cp) genome of O. esquirolii using Illumina sequencing data for the first time. The total length of the cp genome was 154,069 bp with a typical quadripartite structure consisting of a pair of inverted repeats (IRs) of 25,392 bp separated by a large single copy region (LSC) of 85,156 bp and a small single copy region (SSC) of18,129 bp. The genome comprised 114 unique genes with 80 protein-coding genes, 30 tRNA genes, and four rRNA genes. Thirty-one repeat sequences and 74 simple sequence repeats (SSRs) were identified. Genome alignment across five plastid genomes of Gesneriaceae indicated a high sequence similarity. Four highly variable sites (rps16-trnQ, trnS-trnG, ndhF-rpl32, and ycf 1) were identified. Phylogenetic analysis indicated that O. esquirolii grouped together with O. mileensis, supporting resurrection of the name Oreocharis esquirolii from Thamnocharisesquirolii. The complete cp genome sequence will contribute to further studies in molecular identification, genetic diversity, and phylogeny.

Download Full-text

Insights into Transcriptional Repression of the Homologous Toxin-Antitoxin Cassettes yefM-yoeB and axe-txe

International Journal of Molecular Sciences ◽

10.3390/ijms21239062 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9062

Author(s):

Barbara Kędzierska ◽

Katarzyna Potrykus ◽

Agnieszka Szalewska-Pałasz ◽

Beata Wodzikowska

Keyword(s):

Transcriptional Repression ◽

Transcription Initiation ◽

Sequence Similarity ◽

In Vitro Transcription ◽

High Sequence Similarity ◽

Sequence Composition ◽

Repressor Complex ◽

Transcriptional Fusions

Transcriptional repression is a mechanism which enables effective gene expression switch off. The activity of most of type II toxin-antitoxin (TA) cassettes is controlled in this way. These cassettes undergo negative autoregulation by the TA protein complex which binds to the promoter/operator sequence and blocks transcription initiation of the TA operon. Precise and tight control of this process is vital to avoid uncontrolled expression of the toxin component. Here, we employed a series of in vivo and in vitro experiments to establish the molecular basis for previously observed differences in transcriptional activity and repression levels of the pyy and pat promoters which control expression of two homologous TA systems, YefM-YoeB and Axe-Txe, respectively. Transcriptional fusions of promoters with a lux reporter, together with in vitro transcription, EMSA and footprinting assays revealed that: (1) the different sequence composition of the −35 promoter element is responsible for substantial divergence in strengths of the promoters; (2) variations in repression result from the TA repressor complex acting at different steps in the transcription initiation process; (3) transcription from an additional promoter upstream of pat also contributes to the observed inefficient repression of axe-txe module. This study provides evidence that even closely related TA cassettes with high sequence similarity in the promoter/operator region may employ diverse mechanisms for transcriptional regulation of their genes.

Download Full-text

NeoRdRp: A comprehensive dataset for identifying RNA-dependent RNA polymerase of various RNA viruses from metatranscriptomic data

10.1101/2021.12.31.474423 ◽

2022 ◽

Author(s):

Shoichi Sakaguchi ◽

Syun-ichi Urayama ◽

Yoshihiro Takaki ◽

Hong Wu ◽

Youichi Suzuki ◽

...

Keyword(s):

Rna Polymerase ◽

Rna Viruses ◽

Rna Virus ◽

Sequence Similarity ◽

Virus Detection ◽

Detection Methods ◽

Amino Acid Sequence Similarity ◽

Sequencing Data ◽

Rna Dependent Rna Polymerase ◽

Multiple Sequence

RNA viruses are distributed in various environments, and most RNA viruses have been recently identified by metatranscriptome sequencing. However, due to the high nucleotide diversity of RNA viruses, it is still challenging to identify their sequences. Therefore, this study generated a dataset of RNA-dependent RNA polymerase (RdRp) domains essential for all RNA viruses belonging to Orthornavirae. Also, the collected genes with RdRp domains from various RNA viruses were clustered by amino acid sequence similarity. For each cluster, a multiple sequence alignment was generated, and a hidden Markov model (HMM) profile was created if the number of sequences was greater than five. Using the 1,467 HMM profiles, we detected RdRp domains in the RefSeq RNA virus sequences, combined the hit sequences with the RdRp domains, and reconstructed the HMM profiles. As a result, 2,234 HMM profiles were generated from 12,316 RdRp domain sequences, and the dataset was named NeoRdRp. Additionally, using the UniProt dataset, we confirmed that almost all NeoRdRp HMM profiles could specifically detect RdRps in Orthornavirae. Furthermore, we compared the NeoRdRp dataset with two previously reported RNA virus detection methods to detect RNA virus sequences from metatranscriptome sequencing data. Our methods can identify most of the RNA viruses in the datasets; however, some RNA viruses were not detected, similar to the other two methods. The NeoRdRp can be improved by repeatedly adding new RdRp sequences and can be expected to be widely applied as a system for detecting various RNA viruses from metatranscriptome data.

Download Full-text

A pan-cancer landscape of somatic substitutions in non-unique regions of the human genome

10.1101/2020.04.14.040634 ◽

2020 ◽

Author(s):

Maxime Tarabichi ◽

Jonas Demeulemeester ◽

Annelien Verfaillie ◽

Adrienne M. Flanagan ◽

Peter Van Loo ◽

...

Keyword(s):

Human Genome ◽

Sequence Similarity ◽

Gene Families ◽

Regulatory Elements ◽

Cancer Genes ◽

Mutation Load ◽

Sequencing Data ◽

High Sequence Similarity ◽

Computational Analyses ◽

Pan Cancer

AbstractAround 13% of the human genome displays high sequence similarity with at least one other chromosomal position and thereby poses challenges for computational analyses such as detection of somatic events in cancer. We here extract features of sequencing data from across non-unique regions and employ a machine learning pipeline to describe a landscape of somatic substitutions in 2,658 cancers from the PCAWG cohort. We show mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation load and substitution profiles, and can be validated with linked-read sequencing. This uncovers hidden mutations in ~1,700 coding sequences and thousands of regulatory elements, including known cancer genes, immunoglobulins, and highly mutated gene families.

Download Full-text

Efficient Crispr-cas9 Mediated Pooled-sgRNAs Assembly Accelerates Targeting Multiple Genes Related to Male Sterility in Cotton

10.21203/rs.3.rs-107438/v1 ◽

2020 ◽

Author(s):

Mohamed Ramadan ◽

Muna Alariqi ◽

Yizan Ma ◽

Yanlong Li ◽

Zhenping Liu ◽

...

Keyword(s):

Genetic Transformation ◽

Sequence Similarity ◽

High Specificity ◽

Transformation Method ◽

High Sequence Similarity ◽

Single Experiment ◽

Next Generation Sequencing Technology ◽

Practical Applications ◽

Assembly Method ◽

Multiple Genes

Abstract Background: Upland cotton (Gossypium hirsutum), harboring a complex allotetraploid genome consists of A and D subgenomes. The genes in have multiple copies with high sequence similarity that makes genetic, genomic and functional analysis extremely challenging. The recent accessibility of CRISPR/Cas9 tool offers the ability to modify targeted locus efficiently in various complicated plant genomes. However, current cotton transformation method targeting one gene requires a complicated, long and laborious regeneration process. Hence, optimizing strategy to target multiple genes is of great value in cotton functional genomics and practical applications of genetic engineering.Results: To target multiple genes in a single experiment, 112 plant development-related genes were knocked out via optimized CRISPR-Cas9 system. We optimized the key steps of pooled sgRNAs assembly method by which 116 sgRNAs pooled together into 4 groups (each group consisted of 29 sgRNAs). Each group of sgRNAs was compiled in one PCR reaction which subsequently went through one round of vector construction, transformation, sgRNAs identification and also one round of genetic transformation. Through the genetic transformation mediated Agrobacterium, we successfully generated more than 800 plants. For mutants identification, Next Generation Sequencing technology has been used and results showed that all generated plants were positive and all targeted genes were covered. Interestingly, among all the transgenic plants, 85% harbored a single sgRNA insertion, 9% two insertions, 3% three different sgRNAs insertions, 2.5% mutated sgRNAs. These plants with different targeted sgRNAs exhibited numerous combinations of phenotypes in plant flowering tissues. Conclusion: All targeted genes were successfully edited with high specificity which makes our pooled sgRNAs assembly a simple, fast and efficient method/strategy to target multiple genes in one time and surely accelerated the study of genes function in cotton.

Download Full-text

Accurate and Efficient KIR Gene and Haplotype Inference from Genome Sequencing Reads with Novel K-mer Signatures

10.1101/541938 ◽

2019 ◽

Cited By ~ 5

Author(s):

David Roe ◽

Rui Kuang

Keyword(s):

Genome Sequencing ◽

Sequence Similarity ◽

Killer Cell ◽

Haplotype Pair ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

High Sequence Similarity ◽

Data Set ◽

Kir Genes ◽

Kir Gene

AbstractThe killer cell immunoglobulin-like receptor (KIR) proteins evolve to fight viruses and mediate the body’s reaction to pregnancy. These roles provide selection pressure for variation at both the structural/haplotype and base/allele levels. At the same time, the genes have evolved relatively recently by tandem duplication and therefore exhibit very high sequence similarity over thousands of bases. These variation-homology patterns make it impossible to interpret KIR haplotypes from abundant short-read genome sequencing data at population scale using existing methods. Here, we developed an efficient computational approach for in silico KIR probe interpretation (KPI) to accurately interpret individual’s KIR genes and haplotype-pairs from KIR sequencing reads. We designed synthetic 25-base sequence probes by analyzing previously reported haplotype sequences, and we developed a bioinformatics pipeline to interpret the probes in the context of 16 KIR genes and 16 haplotype structures. We demonstrated its accuracy on a synthetic data set as well as a real whole genome sequences from 748 individuals from The Genome of the Netherlands (GoNL). The GoNL predictions were compared with predictions from SNP-based predictions. Our results show 100% accuracy rate for the synthetic tests and a 99.6% family-consistency rate in the GoNL tests. Agreement with the SNP-based calls on KIR genes ranges from 72-100% with a mean of 92%; most differences occur in genes KIR2DS2, KIR2DL2, KIR2DS3, and KIR2DL5 where KPI predicts presence and the SNP-based interpretation predicts absence. Overall, the evidence suggests that KPI’s accuracy is 97% or greater for both KIR gene and haplotype-pair predictions, although the presence/absence genotyping leads to ambiguous haplotype-pair predictions with 16 reference KIR haplotype structures. KPI is free, open, and easily executable as a Nextflow workflow supported by a Docker environment at https://github.com/droeatumn/kpi.

Download Full-text

Accurate and Efficient KIR Gene and Haplotype Inference From Genome Sequencing Reads With Novel K-mer Signatures

Frontiers in Immunology ◽

10.3389/fimmu.2020.583013 ◽

2020 ◽

Vol 11 ◽

Author(s):

David Roe ◽

Rui Kuang

Keyword(s):

Genome Sequencing ◽

Sequence Similarity ◽

Killer Cell ◽

Haplotype Pair ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

High Sequence Similarity ◽

Data Set ◽

Kir Genes ◽

Kir Gene

The killer-cell immunoglobulin-like receptor (KIR) proteins evolve to fight viruses and mediate the body’s reaction to pregnancy. These roles provide selection pressure for variation at both the structural/haplotype and base/allele levels. At the same time, the genes have evolved relatively recently by tandem duplication and therefore exhibit very high sequence similarity over thousands of bases. These variation-homology patterns make it impossible to interpret KIR haplotypes from abundant short-read genome sequencing data at population scale using existing methods. Here, we developed an efficient computational approach for in silico KIR probe interpretation (KPI) to accurately interpret individual’s KIR genes and haplotype-pairs from KIR sequencing reads. We designed synthetic 25-base sequence probes by analyzing previously reported haplotype sequences, and we developed a bioinformatics pipeline to interpret the probes in the context of 16 KIR genes and 16 haplotype structures. We demonstrated its accuracy on a synthetic data set as well as a real whole genome sequences from 748 individuals from The Genome of the Netherlands (GoNL). The GoNL predictions were compared with predictions from SNP-based predictions. Our results show 100% accuracy rate for the synthetic tests and a 99.6% family-consistency rate in the GoNL tests. Agreement with the SNP-based calls on KIR genes ranges from 72%–100% with a mean of 92%; most differences occur in genes KIR2DS2, KIR2DL2, KIR2DS3, and KIR2DL5 where KPI predicts presence and the SNP-based interpretation predicts absence. Overall, the evidence suggests that KPI’s accuracy is 97% or greater for both KIR gene and haplotype-pair predictions, and the presence/absence genotyping leads to ambiguous haplotype-pair predictions with 16 reference KIR haplotype structures. KPI is free, open, and easily executable as a Nextflow workflow supported by a Docker environment at https://github.com/droeatumn/kpi.

Download Full-text

High Genetic Diversity and Adaptive Potential of Two Simian Hemorrhagic Fever Viruses in a Wild Primate Population

10.1101/001040 ◽

2013 ◽

Author(s):

Adam L. Bailey ◽

Michael Lauck ◽

Andrea Weiler ◽

Samuel D. Sibley ◽

Jorge M. Dinis ◽

...

Keyword(s):

Genetic Diversity ◽

Rna Viruses ◽

Hemorrhagic Fever ◽

Biological Properties ◽

Open Reading Frames ◽

Red Colobus ◽

High Genetic Diversity ◽

Simian Hemorrhagic Fever Virus ◽

Hemorrhagic Fever Virus ◽

Natural Hosts

Key biological properties such as high genetic diversity and high evolutionary rate enhance the potential of certain RNA viruses to adapt and emerge. Identifying viruses with these properties in their natural hosts could dramatically improve disease forecasting and surveillance. Recently, we discovered two novel members of the viral family Arteriviridae: simian hemorrhagic fever virus (SHFV)-krc1 and SHFV-krc2, infecting a single wild red colobus (Procolobus rufomitratus tephrosceles) in Kibale National Park, Uganda. Nearly nothing is known about the biological properties of SHFVs in nature, although the SHFV type strain, SHFV-LVR, has caused devastating outbreaks of viral hemorrhagic fever in captive macaques. Here we detected SHFV-krc1 and SHFV-krc2 in 40% and 47% of 60 wild red colobus tested, respectively. We found viral loads in excess of 1x10^6-1x10^7 RNA copies per milliliter of blood plasma for each of these viruses. SHFV-krc1 and SHFV-krc2 also showed high genetic diversity at both the inter- and intra-host levels. Analyses of synonymous and non-synonymous nucleotide diversity across viral genomes revealed patterns suggestive of positive selection in SHFV open reading frames (ORF) 5 (SHFV-krc2 only) and 7 (SHFV-krc1 and SHFV-krc2). Thus, these viruses share several important properties with some of the most rapidly evolving, emergent RNA viruses.

Download Full-text