PHASIS: A computational suite for de novo discovery and characterization of phased, siRNA-generating loci and their miRNA triggers

Mapping Intimacies ◽

10.1101/158832 ◽

2017 ◽

Cited By ~ 7

Author(s):

Atul Kakrana ◽

Pingchuan Li ◽

Parth Patel ◽

Reza Hammond ◽

Deepti Anand ◽

...

Keyword(s):

De Novo ◽

Sequencing Data ◽

Protein Coding ◽

Secondary Sirnas ◽

Integrated Methods ◽

Non Coding Rnas

AbstractPhased, secondary siRNAs (phasiRNAs) are found widely in plants, from protein-coding transcripts and long, non-coding RNAs; animal piRNAs are also phased. Integrated methods characterizing “PHAS” loci are unavailable, and existing methods are quite limited and inefficient in handling large volumes of sequencing data. The PHASIS suite described here provides complete tools for the computational characterization of PHAS loci, with an emphasis on plants, in which these loci are numerous. Benchmarked comparisons demonstrate that PHASIS is sensitive, highly scalable and fast. Importantly, PHASIS eliminates the requirement of a sequenced genome and PARE/degradome data for discovery of phasiRNAs and their miRNA triggers.

Download Full-text

CALINCA—A Novel Pipeline for the Identification of lncRNAs in Podocyte Disease

Cells ◽

10.3390/cells10030692 ◽

2021 ◽

Vol 10 (3) ◽

pp. 692

Author(s):

Sweta Talyan ◽

Samantha Filipów ◽

Michael Ignarski ◽

Magdalena Smieszek ◽

He Chen ◽

...

Keyword(s):

Cell Biology ◽

Mammalian Cells ◽

De Novo ◽

Depth Information ◽

Gene Products ◽

Classical Analysis ◽

Protein Coding ◽

Bioinformatic Pipeline ◽

Non Coding Rnas ◽

Filtration Unit

Diseases of the renal filtration unit—the glomerulus—are the most common cause of chronic kidney disease. Podocytes are the pivotal cell type for the function of this filter and focal-segmental glomerulosclerosis (FSGS) is a classic example of a podocytopathy leading to proteinuria and glomerular scarring. Currently, no targeted treatment of FSGS is available. This lack of therapeutic strategies is explained by a limited understanding of the defects in podocyte cell biology leading to FSGS. To date, most studies in the field have focused on protein-coding genes and their gene products. However, more than 80% of all transcripts produced by mammalian cells are actually non-coding. Here, long non-coding RNAs (lncRNAs) are a relatively novel class of transcripts and have not been systematically studied in FSGS to date. The appropriate tools to facilitate lncRNA research for the renal scientific community are urgently required due to a row of challenges compared to classical analysis pipelines optimized for coding RNA expression analysis. Here, we present the bioinformatic pipeline CALINCA as a solution for this problem. CALINCA automatically analyzes datasets from murine FSGS models and quantifies both annotated and de novo assembled lncRNAs. In addition, the tool provides in-depth information on podocyte specificity of these lncRNAs, as well as evolutionary conservation and expression in human datasets making this pipeline a crucial basis to lncRNA studies in FSGS.

Download Full-text

Genome-Wide Identification and Characterization of Long Non-Coding RNAs in Peanut

Genes ◽

10.3390/genes10070536 ◽

2019 ◽

Vol 10 (7) ◽

pp. 536 ◽

Cited By ~ 2

Author(s):

Xiaobo Zhao ◽

Liming Gan ◽

Caixia Yan ◽

Chunjuan Li ◽

Quanxi Sun ◽

...

Keyword(s):

Large Scale ◽

Target Genes ◽

Sequencing Data ◽

Regulatory Processes ◽

Genome Wide ◽

Non Coding Rnas ◽

Identification And Characterization ◽

Lower Expression ◽

Weighted Correlation

Long non-coding RNAs (lncRNAs) are involved in various regulatory processes although they do not encode protein. Presently, there is little information regarding the identification of lncRNAs in peanut (Arachis hypogaea Linn.). In this study, 50,873 lncRNAs of peanut were identified from large-scale published RNA sequencing data that belonged to 124 samples involving 15 different tissues. The average lengths of lncRNA and mRNA were 4335 bp and 954 bp, respectively. Compared to the mRNAs, the lncRNAs were shorter, with fewer exons and lower expression levels. The 4713 co-expression lncRNAs (expressed in all samples) were used to construct co-expression networks by using the weighted correlation network analysis (WGCNA). LncRNAs correlating with the growth and development of different peanut tissues were obtained, and target genes for 386 hub lncRNAs of all lncRNAs co-expressions were predicted. Taken together, these findings can provide a comprehensive identification of lncRNAs in peanut.

Download Full-text

Characterization of testis-specific LINC01016 gene reveals isoform-specific roles in controlling biological processes

Journal of the Endocrine Society ◽

10.1210/jendso/bvab153 ◽

2021 ◽

Author(s):

Enrique I Ramos ◽

Barbara Yang ◽

Yasmin M Vasquez ◽

Ken Y Lin ◽

Ramesh Choudhari ◽

...

Keyword(s):

De Novo ◽

Biological Processes ◽

Detailed Characterization ◽

Aberrant Expression ◽

Protein Coding ◽

Lncrna Gene ◽

Gene Expression Analyses ◽

Uterine Cancers ◽

Exon Usage

Abstract Long noncoding RNAs (lncRNAs) have emerged as critical regulators of biological processes. However, the aberrant expression of an isoform from the same lncRNA gene could lead to RNA with altered functions due to changes in their conformations, leading to diseases. Here, we describe a detailed characterization of the gene which encodes long intergenic non-protein coding RNA 01016 (LINC01016, a.k.a., LncRNA1195) with a focus on its structure, exon usage, and expression in human and macaque tissues. In this study, we show that it is among the highly expressed lncRNAs in the testis, exclusively conserved among non-human primates, suggesting its recent evolution and is expressed and processed into 12 distinct RNAs in testis, cervix, and uterus tissues. Further, we integrate de novo annotation of expressed LINC01016 transcripts and isoform-dependent gene expression analyses to show that human LINC01016 is a multi-exon gene, processed through differential exon usage with isoform-specific roles. Furthermore, in cervical, testicular, and uterine cancers, LINC01016 isoforms are differentially expressed, and their expression is predictive of survival in these cancers. The study has revealed an essential aspect of lncRNA biology, which is rarely associated with coding RNAs that lncRNA genes are precisely processed to generate isoforms with distinct biological roles in specific tissues.

Download Full-text

Germline mosaicism of a missense variant in KCNC2 in a multiplex family with autism and epilepsy

10.1101/2021.12.06.21264306 ◽

2021 ◽

Author(s):

Elvisa Mehinovic ◽

Teddi Gray ◽

Meghan Campbell ◽

Jenny Ekholm ◽

Aaron Wenger ◽

...

Keyword(s):

De Novo ◽

Copy Number Variants ◽

Missense Variant ◽

Missense Mutations ◽

Sequencing Data ◽

Multiplex Family ◽

Protein Coding ◽

Germline Mosaicism ◽

Current Decay ◽

Long Read

ABSTRACTCurrently, protein-coding de novo variants and large copy number variants have been identified as important for ∼30% of individuals with autism. One approach to identify relevant variation in individuals who lack these types of events is by utilizing newer genomic technologies. In this study, highly accurate PacBio HiFi long-read sequencing was applied to a family with autism, treatment-refractory epilepsy, cognitive impairment, and mild dysmorphic features (two affected female full siblings, parents, and one unaffected sibling) with no known clinical variant. From our long-read sequencing data, a de novo missense variant in the KCNC2 gene (encodes Kv3.2 protein) was identified in both affected children. This variant was phased to the paternal chromosome of origin and is likely a germline mosaic. In silico assessment of the variant revealed it was in the top 0.05% of all conserved bases in the genome, and was predicted damaging by Polyphen2, MutationTaster, and SIFT. It was not present in any controls from public genome databases nor in a joint-call set we generated across 49 individuals with publicly available PacBio HiFi data. This specific missense mutation (Val473Ala) has been shown in both an ortholog and paralog of Kv3.2 to accelerate current decay, shift the voltage dependence of activation, and prevent the channel from entering a long-lasting open state. Seven additional missense mutations have been identified in other individuals with neurodevelopmental disorders (p = 1.03 × 10−5). KCNC2 is most highly expressed in the brain; in particular, in the thalamus and is enriched in GABAergic neurons. Long-read sequencing was useful in discovering the relevant variant in this family with autism that had remained a mystery for several years and will potentially have great benefits in the clinic once it is widely available.

Download Full-text

Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA

F1000Research ◽

10.12688/f1000research.10079.1 ◽

2017 ◽

Vol 6 ◽

pp. 57 ◽

Cited By ~ 23

Author(s):

Jonathan F Schmitz ◽

Erich Bornberg-Bauer

Keyword(s):

De Novo ◽

Protein Structures ◽

Divergence Times ◽

Protein Coding ◽

Future Studies ◽

Functional Studies ◽

Protein Coding Genes ◽

Open Questions ◽

Bona Fide

Over the last few years, there has been an increasing amount of evidence for the de novo emergence of protein-coding genes, i.e. out of non-coding DNA. Here, we review the current literature and summarize the state of the field. We focus specifically on open questions and challenges in the study of de novo protein-coding genes such as the identification and verification of de novo-emerged genes. The greatest obstacle to date is the lack of high-quality genomic data with very short divergence times which could help precisely pin down the location of origin of a de novo gene. We conclude that, while there is plenty of evidence from a genetics perspective, there is a lack of functional studies of bona fide de novo genes and almost no knowledge about protein structures and how they come about during the emergence of de novo protein-coding genes. We suggest that future studies should concentrate on the functional and structural characterization of de novo protein-coding genes as well as the detailed study of the emergence of functional de novo protein-coding genes.

Download Full-text

Faculty Opinions recommendation of Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.717960736.793463655 ◽

2012 ◽

Author(s):

François Cambien

Keyword(s):

De Novo ◽

Protein Coding ◽

Protein Coding Genes ◽

Non Coding Rnas

Download Full-text

Structural and functional characterization of a putative de novo gene in Drosophila

Nature Communications ◽

10.1038/s41467-021-21667-6 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Andreas Lange ◽

Prajal H. Patel ◽

Brennen Heames ◽

Adam M. Damry ◽

Thorsten Saenger ◽

...

Keyword(s):

De Novo ◽

Functional Characterization ◽

Comparative Genomic ◽

Noncoding Dna ◽

Protein Coding ◽

Ancestral Sequences ◽

De Novo Gene ◽

Genomic Studies ◽

Biochemical Genetic

AbstractComparative genomic studies have repeatedly shown that new protein-coding genes can emerge de novo from noncoding DNA. Still unknown is how and when the structures of encoded de novo proteins emerge and evolve. Combining biochemical, genetic and evolutionary analyses, we elucidate the function and structure of goddard, a gene which appears to have evolved de novo at least 50 million years ago within the Drosophila genus. Previous studies found that goddard is required for male fertility. Here, we show that Goddard protein localizes to elongating sperm axonemes and that in its absence, elongated spermatids fail to undergo individualization. Combining modelling, NMR and circular dichroism (CD) data, we show that Goddard protein contains a large central α-helix, but is otherwise partially disordered. We find similar results for Goddard’s orthologs from divergent fly species and their reconstructed ancestral sequences. Accordingly, Goddard’s structure appears to have been maintained with only minor changes over millions of years.

Download Full-text

Assembly of the Mitochondrial Genome in the Campanulaceae Family Using Illumina Low-Coverage Sequencing

Genes ◽

10.3390/genes9080383 ◽

2018 ◽

Vol 9 (8) ◽

pp. 383 ◽

Cited By ~ 2

Author(s):

Hyun-Oh Lee ◽

Ji-Weon Choi ◽

Jeong-Ho Baek ◽

Jae-Hyeon Oh ◽

Sang-Choon Lee ◽

...

Keyword(s):

Mitochondrial Genome ◽

Dna Sequencing ◽

De Novo ◽

Genome Structure ◽

Sequencing Data ◽

Phylogenetic Characterization ◽

Low Coverage ◽

Paired End Sequencing ◽

Circular Chromosomes

Platycodon grandiflorus (balloon flower) and Codonopsis lanceolata (bonnet bellflower) are important herbs used in Asian traditional medicine, and both belong to the botanical family Campanulaceae. In this study, we designed and implemented a de novo DNA sequencing and assembly strategy to map the complete mitochondrial genomes of the first two members of the Campanulaceae using low-coverage Illumina DNA sequencing data. We produced a total of 28.9 Gb of paired-end sequencing data from the genomic DNA of P. grandiflorus (20.9 Gb) and C. lanceolata (8.0 Gb). The assembled mitochondrial genome of P. grandiflorus was found to consist of two circular chromosomes; the master circle contains 56 genes, and the minor circle contains 42 genes. The C. lanceolata mitochondrial genome consists of a single circle harboring 54 genes. Using a comparative genome structure and a pattern of repeated sequences, we show that the P. grandiflorus minor circle resulted from a recombination event involving the direct repeats of the master circle. Our dataset will be useful for comparative genomics and for evolutionary studies, and will facilitate further biological and phylogenetic characterization of species in the Campanulaceae.

Download Full-text

Systematic and computational identification of Androctonus crassicauda long non-coding RNAs

Scientific Reports ◽

10.1038/s41598-021-83815-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Fatemeh Salabi ◽

Hedieh Jafari ◽

Shahrokh Navidpour ◽

Ayeh Sadat Sadr

Keyword(s):

Gc Content ◽

Sequencing Data ◽

Closely Related Species ◽

Protein Coding ◽

Model Animal ◽

Lower Protein ◽

Non Coding Rnas ◽

Reference Genomes ◽

Per Gene

AbstractThe potential function of long non-coding RNAs in regulating neighbor protein-coding genes has attracted scientists’ attention. Despite the important role of lncRNAs in biological processes, a limited number of studies focus on non-model animal lncRNAs. In this study, we used a stringent step-by-step filtering pipeline and machine learning-based tools to identify the specific Androctonus crassicauda lncRNAs and analyze the features of predicted scorpion lncRNAs. 13,401 lncRNAs were detected using pipeline in A. crassicauda transcriptome. The blast results indicated that the majority of these lncRNAs sequences (12,642) have no identifiable orthologs even in closely related species and those considered as novel lncRNAs. Compared to lncRNA prediction tools indicated that our pipeline is a helpful approach to distinguish protein-coding and non-coding transcripts from RNA sequencing data of species without reference genomes. Moreover, analyzing lncRNA characteristics in A. crassicauda uncovered that lower protein-coding potential, lower GC content, shorter transcript length, and less number of isoform per gene are outstanding features of A. crassicauda lncRNAs transcripts.

Download Full-text

Extreme purifying selection against point mutations in the human genome

10.1101/2021.08.23.457339 ◽

2021 ◽

Author(s):

Noah Dukler ◽

Mehreen R Mughal ◽

Ritika Ramani ◽

Yi-Fei Huang ◽

Adam Siepel

Keyword(s):

Human Genome ◽

De Novo ◽

Point Mutations ◽

Purifying Selection ◽

Selection Coefficient ◽

Sequencing Data ◽

Protein Coding ◽

Coding Regions ◽

Protein Coding Genes ◽

Selective Effects

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.

Download Full-text