Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive

Mapping Intimacies ◽

10.1101/386441 ◽

2018 ◽

Cited By ~ 2

Author(s):

Brian Tsui ◽

Michelle Dow ◽

Dylan Skola ◽

Hannah Carter

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Rna Seq ◽

Dna And Rna ◽

Human Sequence ◽

Sequence Read Archive ◽

Sequencing Library ◽

Ncbi Dbsnp ◽

Data Worth ◽

Meta Analyses

The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by WXS, suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.

Download Full-text

anonymizeBAM: Versatile anonymization of human sequence data for open data sharing

10.1101/2021.01.11.426206 ◽

2021 ◽

Author(s):

Christoph Ziegenhain ◽

Rickard Sandberg

Keyword(s):

Data Sharing ◽

Genetic Variant ◽

Sequence Data ◽

Life Sciences ◽

Open Data ◽

Complete Removal ◽

Rna Seq ◽

Human Sequence ◽

Versatile Tool ◽

Variant Information

AbstractThe risks associated with re-identification of human genetic data are severely limiting open data sharing in life sciences. Here, we developed anonymizeBAM, a versatile tool for the anonymization of genetic variant information present in sequence data. Applying anonymizeBAM to single-cell RNA-seq and ATAC-seq datasets confirmed the complete removal of donor-related genetic information. Therefore, the accurate generation of de-identified sequence data will re-enable open sharing in sequencing-based studies for improved transparency, reproducibility, and innovation.

Download Full-text

The neuropeptide Drosulfakinin regulates social isolation-induced aggression in Drosophila

10.1101/646232 ◽

2019 ◽

Author(s):

Pavan Agrawal ◽

Damian Kao ◽

Phuong Chung ◽

Loren L. Looger

Keyword(s):

Social Isolation ◽

Animal Behavior ◽

Sequence Data ◽

Fruit Fly ◽

Data Availability ◽

Rna Seq ◽

Animal Kingdom ◽

Gene Encoding ◽

Sequence Read Archive ◽

The Brain

ABSTRACTSocial isolation strongly modulates behavior across the animal kingdom. We utilized the fruit fly Drosophila melanogaster to study social isolation-driven changes in animal behavior and gene expression in the brain. RNA-seq identified several head-expressed genes strongly responding to social isolation or enrichment. Of particular interest, social isolation downregulated expression of the gene encoding the neuropeptide Drosulfakinin (Dsk), the homologue of vertebrate cholecystokinin (CCK), which is critical for many mammalian social behaviors. Dsk knockdown significantly increased social isolation-induced aggression. Genetic activation or silencing of Dsk neurons each similarly increased isolation-driven aggression. Our results suggest a U-shaped dependence of social isolation-induced aggressive behavior on Dsk signaling, similar to the actions of many neuromodulators in other contexts.Data availabilityThe raw sequence data from RNA-seq experiments has been deposited into the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) with accession number: PRJNA481582. Supplementary files and figures accompany this article.

Download Full-text

A combined RNA-Seq and comparative genomics approach identifies 1,085 candidate structured RNAs expressed in human microbiomes

10.1101/2020.03.31.018887 ◽

2020 ◽

Cited By ~ 2

Author(s):

Brayon J. Fremin ◽

Ami S. Bhatt

Keyword(s):

Comparative Genomics ◽

Experimental Approach ◽

Genomic Sequence ◽

Sequence Data ◽

Human Microbiome ◽

Human Microbiome Project ◽

Computational Approach ◽

Rna Seq ◽

Rna Structures ◽

Experimental Approaches

AbstractStructured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches by searching for motif structures in genomic sequence data. However, only a subset of these candidate structured RNAs, those from culturable, well-studied microbes, have been shown to be transcribed. As the human microbiome contains thousands of species and strains of microbes, we sought to apply both informatic and experimental approaches to these organisms to identify novel transcribed structured RNAs. We combine an experimental approach, RNA-Seq, with an informatic approach, comparative genomics across the human microbiome project, to discover 1,085 candidate, conserved structured RNAs that are actively transcribed in human fecal microbiomes. These predictions include novel tracrRNAs that associate with Cas9 and RNA structures encoded in overlapping regions of the genome that are in opposing orientations. In summary, this combined experimental and computational approach enables the discovery of thousands of novel candidate structured RNAs.

Download Full-text

Faculty Opinions recommendation of A likelihood ratio test of speciation with gene flow using genomic sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.3540959.3240060 ◽

2010 ◽

Author(s):

Nicolas Galtier ◽

Julien Dutheil

Keyword(s):

Gene Flow ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Genomic Sequence ◽

Sequence Data ◽

Ratio Test

Download Full-text

PoGB-Pred: Prediction of Antifreeze Proteins Sequences using Amino Acid Composition with Feature Selection followed by a Sequential based Ensemble Approach

Current Bioinformatics ◽

10.2174/1574893615999200707141926 ◽

2020 ◽

Vol 15 ◽

Author(s):

Affan Alim ◽

Abdul Rafay ◽

Imran Naseem

Keyword(s):

Amino Acid ◽

Dimension Reduction ◽

Protein Identification ◽

Cold Water ◽

Genomic Sequence ◽

Sequence Data ◽

Antifreeze Proteins ◽

Building Blocks ◽

Gradient Boosting ◽

Proposed Model

Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process which may cause the rupture in the internal cells and tissues. AFP’s have attracted attention and interest in food industries and cryopreservation. Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce the competitive performance on highly distinct AFP structure. Methods: In this study, we propose to use machine learning-based algorithms Principal Component Analysis (PCA) followed by Gradient Boosting (GB) for antifreeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments composition of amino acid and dipeptide are used. PCA, in particular, is proposed to dimension reduction and high variance retaining of data which is followed by an ensemble method named gradient boosting for modelling and classification. Results: The proposed method obtained the superfluous performance on PDB, Pfam and Uniprot dataset as compared with the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components a high accuracy of 89.63 was achieved which is superior to the 87.41 utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different dataset such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, our proposed method attained high sensitivity of 79.16 which is 12.50 better than state-of-the-art the RAFP-pred method. Conclusion: AFPs have a common function with distinct structure. Therefore, the development of a single model for different sequences often fails to AFPs. A robust results have been shown by our proposed model on the diversity of training and testing dataset. The results of the proposed model outperformed compared to the previous AFPs prediction method such as RAFP-Pred. Our model consists of PCA for dimension reduction followed by gradient boosting for classification. Due to simplicity, scalability properties and high performance result our model can be easily extended for analyzing the proteomic and genomic dataset.

Download Full-text

Special Issue: Genetic Basis of Phenotypic Variation in Drosophila and Other Insects

Genes ◽

10.3390/genes12081212 ◽

2021 ◽

Vol 12 (8) ◽

pp. 1212

Author(s):

J. Spencer Johnston ◽

Carl E. Hjelmen

Keyword(s):

Next Generation Sequencing ◽

Genetic Basis ◽

Genomic Sequence ◽

Sequence Data ◽

Complete Genomic Sequence ◽

Special Issue ◽

Model Species ◽

Road Map ◽

Generation Sequencing ◽

Complete Genomic

Next-generation sequencing provides a nearly complete genomic sequence for model and non-model species alike; however, this wealth of sequence data includes no road map [...]

Download Full-text

Methyltransferase-directed orthogonal tagging and sequencing of miRNAs and bacterial small RNAs

BMC Biology ◽

10.1186/s12915-021-01053-w ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Milda Mickutė ◽

Kotryna Kvederavičiūtė ◽

Aleksandr Osipenko ◽

Raminta Mineikaitė ◽

Saulius Klimašauskas ◽

...

Keyword(s):

Rna Sequencing ◽

Regulatory Networks ◽

Library Preparation ◽

Rna Seq ◽

Basic Principles ◽

Cofactor Binding ◽

Sequencing Library ◽

Sequencing Library Preparation ◽

Target Rna

Abstract Background Targeted installation of designer chemical moieties on biopolymers provides an orthogonal means for their visualisation, manipulation and sequence analysis. Although high-throughput RNA sequencing is a widely used method for transcriptome analysis, certain steps, such as 3′ adapter ligation in strand-specific RNA sequencing, remain challenging due to structure- and sequence-related biases introduced by RNA ligases, leading to misrepresentation of particular RNA species. Here, we remedy this limitation by adapting two RNA 2′-O-methyltransferases from the Hen1 family for orthogonal chemo-enzymatic click tethering of a 3′ sequencing adapter that supports cDNA production by reverse transcription of the tagged RNA. Results We showed that the ssRNA-specific DmHen1 and dsRNA-specific AtHEN1 can be used to efficiently append an oligonucleotide adapter to the 3′ end of target RNA for sequencing library preparation. Using this new chemo-enzymatic approach, we identified miRNAs and prokaryotic small non-coding sRNAs in probiotic Lactobacillus casei BL23. We found that compared to a reference conventional RNA library preparation, methyltransferase-Directed Orthogonal Tagging and RNA sequencing, mDOT-seq, avoids misdetection of unspecific highly-structured RNA species, thus providing better accuracy in identifying the groups of transcripts analysed. Our results suggest that mDOT-seq has the potential to advance analysis of eukaryotic and prokaryotic ssRNAs. Conclusions Our findings provide a valuable resource for studies of the RNA-centred regulatory networks in Lactobacilli and pave the way to developing novel transcriptome and epitranscriptome profiling approaches in vitro and inside living cells. As RNA methyltransferases share the structure of the AdoMet-binding domain and several specific cofactor binding features, the basic principles of our approach could be easily translated to other AdoMet-dependent enzymes for the development of modification-specific RNA-seq techniques.

Download Full-text

Discovery and assembly of repeat family pseudomolecules from sparse genomic sequence data using the Assisted Automated Assembler of Repeat Families (AAARF) algorithm

BMC Bioinformatics ◽

10.1186/1471-2105-9-235 ◽

2008 ◽

Vol 9 (1) ◽

pp. 235 ◽

Cited By ~ 22

Author(s):

Jeremy D DeBarry ◽

Renyi Liu ◽

Jeffrey L Bennetzen

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Repeat Family

Download Full-text

Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.054171-0 ◽

2014 ◽

Vol 64 (Pt_2) ◽

pp. 316-324 ◽

Cited By ~ 258

Author(s):

Jongsik Chun ◽

Fred A. Rainey

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Original Research ◽

Rrna Gene ◽

New Taxon ◽

Genome Sequences ◽

Microbial World ◽

Content Type ◽

Link Type ◽

Type Strains

The polyphasic approach used today in the taxonomy and systematics of the Bacteria and Archaea includes the use of phenotypic, chemotaxonomic and genotypic data. The use of 16S rRNA gene sequence data has revolutionized our understanding of the microbial world and led to a rapid increase in the number of descriptions of novel taxa, especially at the species level. It has allowed in many cases for the demarcation of taxa into distinct species, but its limitations in a number of groups have resulted in the continued use of DNA–DNA hybridization. As technology has improved, next-generation sequencing (NGS) has provided a rapid and cost-effective approach to obtaining whole-genome sequences of microbial strains. Although some 12 000 bacterial or archaeal genome sequences are available for comparison, only 1725 of these are of actual type strains, limiting the use of genomic data in comparative taxonomic studies when there are nearly 11 000 type strains. Efforts to obtain complete genome sequences of all type strains are critical to the future of microbial systematics. The incorporation of genomics into the taxonomy and systematics of the Bacteria and Archaea coupled with computational advances will boost the credibility of taxonomy in the genomic era. This special issue of International Journal of Systematic and Evolutionary Microbiology contains both original research and review articles covering the use of genomic sequence data in microbial taxonomy and systematics. It includes contributions on specific taxa as well as outlines of approaches for incorporating genomics into new strain isolation to new taxon description workflows.

Download Full-text

Characterization of nucleic acids from extracellular vesicle-enriched human sweat

BMC Genomics ◽

10.1186/s12864-021-07733-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Geneviève Bart ◽

Daniel Fischer ◽

Anatoliy Samoylenko ◽

Artem Zhyvolozhnyi ◽

Pavlo Stehantsev ◽

...

Keyword(s):

Nucleic Acids ◽

Human Genome ◽

Body Fluids ◽

Lower Percentage ◽

Rna Seq ◽

Protein Coding ◽

Human Sweat ◽

Dna And Rna ◽

Ribonucleoprotein Complexes ◽

Eccrine Glands

Abstract Background The human sweat is a mixture of secretions from three types of glands: eccrine, apocrine, and sebaceous. Eccrine glands open directly on the skin surface and produce high amounts of water-based fluid in response to heat, emotion, and physical activity, whereas the other glands produce oily fluids and waxy sebum. While most body fluids have been shown to contain nucleic acids, both as ribonucleoprotein complexes and associated with extracellular vesicles (EVs), these have not been investigated in sweat. In this study we aimed to explore and characterize the nucleic acids associated with sweat particles. Results We used next generation sequencing (NGS) to characterize DNA and RNA in pooled and individual samples of EV-enriched sweat collected from volunteers performing rigorous exercise. In all sequenced samples, we identified DNA originating from all human chromosomes, but only the mitochondrial chromosome was highly represented with 100% coverage. Most of the DNA mapped to unannotated regions of the human genome with some regions highly represented in all samples. Approximately 5 % of the reads were found to map to other genomes: including bacteria (83%), archaea (3%), and virus (13%), identified bacteria species were consistent with those commonly colonizing the human upper body and arm skin. Small RNA-seq from EV-enriched pooled sweat RNA resulted in 74% of the trimmed reads mapped to the human genome, with 29% corresponding to unannotated regions. Over 70% of the RNA reads mapping to an annotated region were tRNA, while misc. RNA (18,5%), protein coding RNA (5%) and miRNA (1,85%) were much less represented. RNA-seq from individually processed EV-enriched sweat collection generally resulted in fewer percentage of reads mapping to the human genome (7–45%), with 50–60% of those reads mapping to unannotated region of the genome and 30–55% being tRNAs, and lower percentage of reads being rRNA, LincRNA, misc. RNA, and protein coding RNA. Conclusions Our data demonstrates that sweat, as all other body fluids, contains a wealth of nucleic acids, including DNA and RNA of human and microbial origin, opening a possibility to investigate sweat as a source for biomarkers for specific health parameters.

Download Full-text