A Gene-Based Method for Cytogenetic Mapping of Repeat-Rich Mosquito Genomes

Reem A. Masri; Dmitriy A. Karagodin; Atashi Sharma; Maria V. Sharakhova

doi:10.3390/insects12020138

A Gene-Based Method for Cytogenetic Mapping of Repeat-Rich Mosquito Genomes

Insects ◽

10.3390/insects12020138 ◽

2021 ◽

Vol 12 (2) ◽

pp. 138

Author(s):

Reem A. Masri ◽

Dmitriy A. Karagodin ◽

Atashi Sharma ◽

Maria V. Sharakhova

Keyword(s):

Genome Mapping ◽

Chromosome Band ◽

Probe Design ◽

Cytogenetic Mapping ◽

Protein Coding ◽

Sequencing Technologies ◽

Mapping Approach ◽

Long Read ◽

New Gene ◽

Gene Transcripts

Long-read sequencing technologies have opened up new avenues of research on the mosquito genome biology, enabling scientists to better understand the remarkable abilities of vectors for transmitting pathogens. Although new genome mapping technologies such as Hi-C scaffolding and optical mapping may significantly improve the quality of genomes, only cytogenetic mapping, with the help of fluorescence in situ hybridization (FISH), connects genomic scaffolds to a particular chromosome and chromosome band. This mapping approach is important for creating and validating chromosome-scale genome assemblies for mosquitoes with repeat-rich genomes, which can potentially be misassembled. In this study, we describe a new gene-based physical mapping approach that was optimized using the newly assembled Aedes albopictus genome, which is enriched with transposable elements. To avoid amplification of the repetitive DNA, 15 protein-coding gene transcripts were used for the probe design. Instead of using genomic DNA, complementary DNA was utilized as a template for development of the PCR-amplified probes for FISH. All probes were successfully amplified and mapped to specific chromosome bands. The genome-unique probes allowed to perform unambiguous mapping of genomic scaffolds to chromosome regions. The method described in detail here can be used for physical genome mapping in other insects.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight

10.1101/514497 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mark T. W. Ebbert ◽

Tanner D. Jensen ◽

Karen Jansen-West ◽

Jonathon P. Sens ◽

Joseph S. Reddy ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Sequencing Data ◽

Systematic Analysis ◽

Protein Coding ◽

Short Read ◽

Sequencing Project ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Download Full-text

Towards complete and error-free genome assemblies of all vertebrate species

Nature ◽

10.1038/s41586-021-03451-0 ◽

2021 ◽

Vol 592 (7856) ◽

pp. 737-746 ◽

Cited By ~ 1

Author(s):

Arang Rhie ◽

Shane A. McCarthy ◽

Olivier Fedrigo ◽

Joana Damas ◽

Giulio Formenti ◽

...

Keyword(s):

Cost Effective ◽

Lessons Learned ◽

Vertebrate Species ◽

High Quality ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

Assembly Error ◽

Reference Genomes

AbstractHigh-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

Download Full-text

Construction of a new chromosome-scale, long-read reference genome assembly of the Syrian hamster, Mesocricetus auratus

10.1101/2021.07.05.451071 ◽

2021 ◽

Author(s):

R. Alan Harris ◽

Muthuswamy Raveendran ◽

Dustin T Lyfoung ◽

Fritz J Sedlazeck ◽

Medhat Mahmoud ◽

...

Keyword(s):

Genome Assembly ◽

Syrian Hamster ◽

Reference Genome ◽

Sequence Data ◽

Mesocricetus Auratus ◽

Protein Coding ◽

Protein Coding Genes ◽

Sequencing Technologies ◽

Long Read ◽

Short Read Sequence

Background The Syrian hamster (Mesocricetus auratus) has been suggested as a useful mammalian model for a variety of diseases and infections, including infection with respiratory viruses such as SARS-CoV-2. The MesAur1.0 genome assembly was published in 2013 using whole-genome shotgun sequencing with short-read sequence data. Current more advanced sequencing technologies and assembly methods now permit the generation of near-complete genome assemblies with higher quality and higher continuity. Findings Here, we report an improved assembly of the M. auratus genome (BCM_Maur_2.0) using Oxford Nanopore Technologies long-read sequencing to produce a chromosome-scale assembly. The total length of the new assembly is 2.46 Gbp, similar to the 2.50 Gbp length of a previous assembly of this genome, MesAur1.0. BCM_Maur_2.0 exhibits significantly improved continuity with a scaffold N50 that is 6.7 times greater than MesAur1.0. Furthermore, 21,616 protein coding genes and 10,459 noncoding genes were annotated in BCM_Maur_2.0 compared to 20,495 protein coding genes and 4,168 noncoding genes in MesAur1.0. This new assembly also improves the unresolved regions as measured by nucleotide ambiguities, where approximately 17.11% of bases in MesAur1.0 were unresolved compared to BCM_Maur_2.0 in which the number of unresolved bases is reduced to 3.00%. Conclusions Access to a more complete reference genome with improved accuracy and continuity will facilitate more detailed, comprehensive, and meaningful research results for a wide variety of future studies using Syrian hamsters as models.

Download Full-text

Mind the gaps – ignoring errors in long read assemblies critically affects protein prediction

10.1101/285049 ◽

2018 ◽

Cited By ~ 9

Author(s):

Mick Watson

Keyword(s):

Genome Sequencing ◽

Single Molecule ◽

Whole Genome ◽

Protein Coding ◽

Single Molecule Sequencing ◽

Truncated Protein ◽

Coding Regions ◽

Sequencing Technologies ◽

Protein Prediction ◽

Long Read

Long read, single molecule sequencing technologies are now routinely used for whole-genome sequencing and assembly. However, even after multiple rounds of correction, many errors remain which can critically affect protein coding regions, resulting in significantly altered and often truncated protein predictions.

Download Full-text

Transposable element expression at unique loci in single cells with CELLO-seq

10.1101/2020.10.02.322073 ◽

2020 ◽

Author(s):

Rebecca V Berrens ◽

Andrian Yang ◽

Christopher E Laumer ◽

Aaron TL Lun ◽

Florian Bieberich ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Biological Processes ◽

Specific Expression ◽

Protein Coding ◽

Sequencing Technologies ◽

Repetitive Nature ◽

Long Read ◽

Induced Pluripotent

AbstractThe role of Transposable Elements (TEs) in regulating diverse biological processes, from early development to cancer, is becoming increasing appreciated. However, unlike other biological processes, next generation single-cell sequencing technologies are ill-suited for assaying TE expression: in particular, their highly repetitive nature means that short cDNA reads cannot be unambiguously mapped to a specific locus. Consequently, it is extremely challenging to understand the mechanisms by which TE expression is regulated and how they might themselves regulate other protein coding genes. To resolve this, we introduce CELLO-seq, a novel method and computational framework for performing long-read RNA sequencing at single cell resolution. CELLO-seq allows for full-length RNA sequencing and enables measurement of allelic, isoform and TE expression at unique loci. We use CELLO-seq to assess the widespread expression of TEs in 2-cell mouse blastomeres as well as human induced pluripotent stem cells (hiPSCs). Across both species, old and young TEs showed evidence of locus-specific expression, with simulations demonstrating that only a small number of very young elements in the mouse could not be mapped back to with high confidence. Exploring the relationship between the expression of individual elements and putative regulators revealed surprising heterogeneity, with TEs within a class showing different patterns of correlation, suggesting distinct regulatory mechanisms.

Download Full-text

Chromosome assembled and annotated genome sequence of Aspergillus flavus NRRL 3357

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab213 ◽

2021 ◽

Author(s):

Jeffrey M Skerker ◽

Kaila M Pianalto ◽

Stephen J Mondo ◽

Kunlong Yang ◽

Adam P Arkin ◽

...

Keyword(s):

Aspergillus Flavus ◽

Population Biology ◽

Draft Genome ◽

Model Organism ◽

Opportunistic Pathogen ◽

Protein Coding ◽

Oxford Nanopore ◽

Long Read ◽

New Gene ◽

Major Producer

Abstract Aspergillus flavus is an opportunistic pathogen of crops, including peanuts and maize, and is the second leading cause of aspergillosis in immunocompromised patients. A. flavus is also a major producer of the mycotoxin, aflatoxin, a potent carcinogen, which results in significant crop losses annually. The A. flavus isolate NRRL 3357 was originally isolated from peanut and has been used as a model organism for understanding the regulation and production of secondary metabolites, such as aflatoxin. A draft genome of NRRL 3357 was previously constructed, enabling the development of molecular tools and for understanding population biology of this particular species. Here, we describe an updated, near complete, telomere-to-telomere assembly and re-annotation of the eight chromosomes of A. flavus NRRL 3357 genome, accomplished via long-read PacBio and Oxford Nanopore technologies combined with Illumina short-read sequencing. A total of 13,715 protein-coding genes were predicted. Using RNA-seq data, a significant improvement was achieved in predicted 5’ and 3’ untranslated regions, which were incorporated into the new gene models.

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

Genome Biology ◽

10.1186/s13059-021-02369-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R. Mercer ◽

Adam D. Ewing ◽

Geoffrey J. Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Open Reading Frames ◽

Cdna Sequencing ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Long Read ◽

Normal Human ◽

Reading Frames ◽

Transcriptional Landscape

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Investigation of long non-coding RNAs as regulatory players of grapevine response to powdery and downy mildew infection

BMC Plant Biology ◽

10.1186/s12870-021-03059-6 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Garima Bhatia ◽

Santosh K. Upadhyay ◽

Anuradha Upadhyay ◽

Kashmir Singh

Keyword(s):

Downy Mildew ◽

Plasmopara Viticola ◽

Defense Responses ◽

Protein Coding ◽

Functional Roles ◽

Real Time Quantitative Pcr ◽

Transcriptional Reprogramming ◽

Sequencing Technologies ◽

Non Coding Rnas ◽

Fungal Phytopathogens

Abstract Background Long non-coding RNAs (lncRNAs) are regulatory transcripts of length > 200 nt. Owing to the rapidly progressing RNA-sequencing technologies, lncRNAs are emerging as considerable nodes in the plant antifungal defense networks. Therefore, we investigated their role in Vitis vinifera (grapevine) in response to obligate biotrophic fungal phytopathogens, Erysiphe necator (powdery mildew, PM) and Plasmopara viticola (downy mildew, DM), which impose huge agro-economic burden on grape-growers worldwide. Results Using computational approach based on RNA-seq data, 71 PM- and 83 DM-responsive V. vinifera lncRNAs were identified and comprehensively examined for their putative functional roles in plant defense response. V. vinifera protein coding sequences (CDS) were also profiled based on expression levels, and 1037 PM-responsive and 670 DM-responsive CDS were identified. Next, co-expression analysis-based functional annotation revealed their association with gene ontology (GO) terms for ‘response to stress’, ‘response to biotic stimulus’, ‘immune system process’, etc. Further investigation based on analysis of domains, enzyme classification, pathways enrichment, transcription factors (TFs), interactions with microRNAs (miRNAs), and real-time quantitative PCR of lncRNAs and co-expressing CDS pairs suggested their involvement in modulation of basal and specific defense responses such as: Ca2+-dependent signaling, cell wall reinforcement, reactive oxygen species metabolism, pathogenesis related proteins accumulation, phytohormonal signal transduction, and secondary metabolism. Conclusions Overall, the identified lncRNAs provide insights into the underlying intricacy of grapevine transcriptional reprogramming/post-transcriptional regulation to delay or seize the living cell-dependent pathogen growth. Therefore, in addition to defense-responsive genes such as TFs, the identified lncRNAs can be further examined and leveraged to candidates for biotechnological improvement/breeding to enhance fungal stress resistance in this susceptible fruit crop of economic and nutritional importance.

Download Full-text

Genome sequences of human cytomegalovirus strain TB40/E variants propagated in fibroblasts and epithelial cells

Virology Journal ◽

10.1186/s12985-021-01583-3 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Ahmed Al Qaffas ◽

Salvatore Camiolo ◽

Mai Vo ◽

Alexis Aguiar ◽

Amine Ourahmane ◽

...

Keyword(s):

Epithelial Cells ◽

Human Cytomegalovirus ◽

Viral Entry ◽

Sequence Data ◽

Laboratory Strain ◽

Serial Passage ◽

Wild Type Virus ◽

Protein Coding ◽

Genetic Changes ◽

Long Read

AbstractThe advent of whole genome sequencing has revealed that common laboratory strains of human cytomegalovirus (HCMV) have major genetic deficiencies resulting from serial passage in fibroblasts. In particular, tropism for epithelial and endothelial cells is lost due to mutations disrupting genes UL128, UL130, or UL131A, which encode subunits of a virion-associated pentameric complex (PC) important for viral entry into these cells but not for entry into fibroblasts. The endothelial cell-adapted strain TB40/E has a relatively intact genome and has emerged as a laboratory strain that closely resembles wild-type virus. However, several heterogeneous TB40/E stocks and cloned variants exist that display a range of sequence and tropism properties. Here, we report the use of PacBio sequencing to elucidate the genetic changes that occurred, both at the consensus level and within subpopulations, upon passaging a TB40/E stock on ARPE-19 epithelial cells. The long-read data also facilitated examination of the linkage between mutations. Consistent with inefficient ARPE-19 cell entry, at least 83% of viral genomes present before adaptation contained changes impacting PC subunits. In contrast, and consistent with the importance of the PC for entry into endothelial and epithelial cells, genomes after adaptation lacked these or additional mutations impacting PC subunits. The sequence data also revealed six single noncoding substitutions in the inverted repeat regions, single nonsynonymous substitutions in genes UL26, UL69, US28, and UL122, and a frameshift truncating gene UL141. Among the changes affecting protein-coding regions, only the one in UL122 was strongly selected. This change, resulting in a D390H substitution in the encoded protein IE2, has been previously implicated in rendering another viral protein, UL84, essential for viral replication in fibroblasts. This finding suggests that IE2, and perhaps its interactions with UL84, have important functions unique to HCMV replication in epithelial cells.

Download Full-text