Rethomics: an R framework to analyse high-throughput behavioural data

Mapping Intimacies ◽

10.1101/305664 ◽

2018 ◽

Author(s):

Quentin Geissmann ◽

Luis Garcia Rodriguez ◽

Esteban J. Beckwith ◽

Giorgio F. Gilestro

Keyword(s):

High Throughput ◽

Link Type ◽

Complex Phenotypes ◽

Extensive Documentation ◽

Behavioural Biology ◽

Computational Solution

AbstractThe recent development of automatised methods to score various behaviours on a large number of animals provides biologists with an unprecedented set of tools to decipher these complex phenotypes. Analysing such data comes with several challenges that are largely shared across acquisition platform and paradigms. Here, we present rethomics, a set of R packages that unifies the analysis of behavioural datasets in an efficient and flexible manner. rethomics offers a computational solution to storing, manipulating and visualising large amounts of behavioural data. We propose it as a tool to bridge the gap between behavioural biology and data sciences, thus connecting computational and behavioural scientists. rethomics comes with a extensive documentation as well as a set of both practical and theoretical tutorials (available at https://rethomics.github.io).

Download Full-text

kataegis: an R package for identification and visualization of the genomic localized hypermutation regions using high-throughput sequencing

BMC Genomics ◽

10.1186/s12864-021-07696-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xue Lin ◽

Yingying Hua ◽

Shuanglin Gu ◽

Li Lv ◽

Xingyu Li ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

R Package ◽

Frequency Of Occurrence ◽

Link Type ◽

Genomic Landscape ◽

One Step ◽

Flanking Regions

Abstract Background Genomic localized hypermutation regions were found in cancers, which were reported to be related to the prognosis of cancers. This genomic localized hypermutation is quite different from the usual somatic mutations in the frequency of occurrence and genomic density. It is like a mutations “violent storm”, which is just what the Greek word “kataegis” means. Results There are needs for a light-weighted and simple-to-use toolkit to identify and visualize the localized hypermutation regions in genome. Thus we developed the R package “kataegis” to meet these needs. The package used only three steps to identify the genomic hypermutation regions, i.e., i) read in the variation files in standard formats; ii) calculate the inter-mutational distances; iii) identify the hypermutation regions with appropriate parameters, and finally one step to visualize the nucleotide contents and spectra of both the foci and flanking regions, and the genomic landscape of these regions. Conclusions The kataegis package is available on Bionconductor/Github (https://github.com/flosalbizziae/kataegis), which provides a light-weighted and simple-to-use toolkit for quickly identifying and visualizing the genomic hypermuation regions.

Download Full-text

Describing Delicate Interactions

Science s STKE ◽

10.1126/stke.3692007tw24 ◽

2007 ◽

Vol 2007 (369) ◽

pp. tw24-tw24

Author(s):

Valda Vinson

Keyword(s):

Transcription Factors ◽

Binding Energy ◽

Dna Binding ◽

High Throughput ◽

Biological Networks ◽

Full Text ◽

Systems Approach ◽

Energy Landscapes ◽

Eukaryotic Transcription ◽

Link Type

Quantifying the affinities of interactions in biological networks, particularly transient ones, remains a challenge. Maerkl and Quake describe a high-throughput microfluidic platform that allows the measurement of transient and low-affinity interactions and characterize the DNA binding energy landscapes for four eukaryotic transcription factors. In two cases, the binding specificities were used to predict which genes the transcription factors would bind and likely regulate.S. J. Maerkl, S. R. Quake, A systems approach to measuring the binding energy landscapes of transcription factors. Science315, 233-237 (2007). [Abstract][Full Text]

Download Full-text

ATAC-seq with unique molecular identifiers improves quantification and footprinting

Communications Biology ◽

10.1038/s42003-020-01403-4 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Tao Zhu ◽

Keyan Liao ◽

Rongfang Zhou ◽

Chunjiao Xia ◽

Weibo Xie

Keyword(s):

Transcription Factor ◽

High Throughput ◽

High Throughput Sequencing ◽

Pcr Amplification ◽

Chromatin Accessibility ◽

Link Type ◽

Pcr Duplicates ◽

Identify Transcription Factor ◽

Accessible Chromatin ◽

Accurate Quantification

AbstractATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) provides an efficient way to analyze nucleosome-free regions and has been applied widely to identify transcription factor footprints. Both applications rely on the accurate quantification of insertion events of the hyperactive transposase Tn5. However, due to the presence of the PCR amplification, it is impossible to accurately distinguish independently generated identical Tn5 insertion events from PCR duplicates using the standard ATAC-seq technique. Removing PCR duplicates based on mapping coordinates introduces increasing bias towards highly accessible chromatin regions. To overcome this limitation, we establish a UMI-ATAC-seq technique by incorporating unique molecular identifiers (UMIs) into standard ATAC-seq procedures. UMI-ATAC-seq can rescue about 20% of reads that are mistaken as PCR duplicates in standard ATAC-seq in our study. We demonstrate that UMI-ATAC-seq could more accurately quantify chromatin accessibility and significantly improve the sensitivity of identifying transcription factor footprints. An analytic pipeline is developed to facilitate the application of UMI-ATAC-seq, and it is available at https://github.com/tzhu-bio/UMI-ATAC-seq.

Download Full-text

Statistical Approaches to Gene X Environment Interactions for Complex Phenotypes

10.7551/mitpress/9780262034685.001.0001 ◽

2016 ◽

Cited By ~ 2

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Randomized Clinical Trials ◽

Partial Least Square ◽

Least Square ◽

Complex Phenotypes ◽

Genes And Environment ◽

Polygenic Scores ◽

Research Designs ◽

Statistical Approaches

Findings from the Human Genome Project and from Genome-Wide Association (GWA) studies indicate that many diseases and traits manifest a more complex genomic pattern than previously assumed. These findings, and advances in high-throughput sequencing, suggest that there are many sources of influence—genetic, epigenetic, and environmental. This volume investigates the role of the interactions of genes and environment (G × E) in diseases and traits (referred to by the contributors as complex phenotypes) including depression, diabetes, obesity, and substance use. The contributors first present different statistical approaches or strategies to address G × E and G × G interactions with high-throughput sequenced data, including two-stage procedures to identify G × E and G × G interactions, marker-set approaches to assessing interactions at the gene level, and the use of a partial-least square (PLS) approach. The contributors then turn to specific complex phenotypes, research designs, or combined methods that may advance the study of G × E interactions, considering such topics as randomized clinical trials in obesity research, longitudinal research designs and statistical models, and the development of polygenic scores to investigate G × E interactions. Contributors Fatima Umber Ahmed, Yin-Hsiu Chen, James Y. Dai, Caroline Y. Doyle, Zihuai He, Li Hsu, Shuo Jiao, Erin Loraine Kinnally, Yi-An Ko, Charles Kooperberg, Seunggeun Lee, Arnab Maity, Jeanne M. McCaffery, Bhramar Mukherjee, Sung Kyun Park, Duncan C. Thomas, Alexandre Todorov, Jung-Ying Tzeng, Tao Wang, Michael Windle, Min Zhang

Download Full-text

Phigaro: high throughput prophage sequence annotation

10.1101/598243 ◽

2019 ◽

Cited By ~ 6

Author(s):

Elizaveta V. Starikova ◽

Polina O. Tikhonova ◽

Nikita A. Prianichnikov ◽

Chris M. Rands ◽

Evgeny M. Zdobnov ◽

...

Keyword(s):

Test Data ◽

High Throughput ◽

Source Code ◽

Sequence Annotation ◽

Command Line ◽

Link Type ◽

Genome Maps ◽

Transposon Insertion ◽

Prophage Sequence

AbstractSummaryPhigaro is a standalone command-line application that is able to detect prophage regions taking raw genome and metagenome assemblies as an input. It also produces dynamic annotated “prophage genome maps” and marks possible transposon insertion spots inside prophages. It provides putative taxonomic annotations that can distinguish tailed from non-tailed phages. It is applicable for mining prophage regions from large metagenomic datasets.AvailabilitySource code for Phigaro is freely available for download at https://github.com/bobeobibo/phigaro along with test data. The code is written in Python.

Download Full-text

DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP

10.1101/122325 ◽

2017 ◽

Author(s):

Sneha Mitra ◽

Anushua Biswas ◽

Leelavati Narlikar

Keyword(s):

High Throughput ◽

Dna Sequences ◽

Chromatin Immunoprecipitation ◽

Direct Contact ◽

Black Box ◽

Cross Linking ◽

Link Type ◽

Chip Experiment ◽

Long Range Interactions ◽

Regulatory Functions

AbstractA high-throughput chromatin immunoprecipitation (ChIP) experiment is like a black-box: it reports all regions that are associated with the profiled protein based on the initial cross-linking step. These regions can be a highly diverse set of DNA sequences, with some making direct contact with the protein, some binding through intermediaries, and some being a result of long-range interactions involving the protein. We present diversity, a method that identifies the distinct components of such a mixture, leaving no data behind, while at the same time, using no prior motif knowledge. Using the example of the REST protein, we show that these different components give insights into the various complexes that may be forming along the chromatin and their regulatory functions.http://diversity.ncl.res.in/ (webserver)https://github.com/NarlikarLab/DIVERSITY (standalone for Mac OSX/Linux)

Download Full-text

MethylExtract: High-Quality methylation maps and SNV calling from whole genome bisulfite sequencing data

F1000Research ◽

10.12688/f1000research.2-217.v2 ◽

2014 ◽

Vol 2 ◽

pp. 217 ◽

Cited By ~ 8

Author(s):

Guillermo Barturen ◽

Antonio Rueda ◽

José L. Oliver ◽

Michael Hackenberg

Keyword(s):

High Throughput ◽

Sequence Variation ◽

High Throughput Sequencing ◽

Whole Genome ◽

Single Nucleotide Variants ◽

High Quality ◽

Single Nucleotide ◽

Error Sources ◽

Link Type ◽

Genome Methylation

Whole genome methylation profiling at a single cytosine resolution is now feasible due to the advent of high-throughput sequencing techniques together with bisulfite treatment of the DNA. To obtain the methylation value of each individual cytosine, the bisulfite-treated sequence reads are first aligned to a reference genome, and then the profiling of the methylation levels is done from the alignments. A huge effort has been made to quickly and correctly align the reads and many different algorithms and programs to do this have been created. However, the second step is just as crucial and non-trivial, but much less attention has been paid to the final inference of the methylation states. Important error sources do exist, such as sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants.We developed MethylExtract, a user friendly tool to: i) generate high quality, whole genome methylation maps and ii) detect sequence variation within the same sample preparation. The program is implemented into a single script and takes into account all major error sources. MethylExtract detects variation (SNVs – Single Nucleotide Variants) in a similar way to VarScan, a very sensitive method extensively used in SNV and genotype calling based on non-bisulfite-treated reads. The usefulness of MethylExtract is shown by means of extensive benchmarking based on artificial bisulfite-treated reads and a comparison to a recently published method, called Bis-SNP.MethylExtract is able to detect SNVs within High-Throughput Sequencing experiments of bisulfite treated DNA at the same time as it generates high quality methylation maps. This simultaneous detection of DNA methylation and sequence variation is crucial for many downstream analyses, for example when deciphering the impact of SNVs on differential methylation. An exclusive feature of MethylExtract, in comparison with existing software, is the possibility to assess the bisulfite failure in a statistical way. The source code, tutorial and artificial bisulfite datasets are available at http://bioinfo2.ugr.es/MethylExtract/ and http://sourceforge.net/projects/methylextract/, and also permanently accessible from 10.5281/zenodo.7144.

Download Full-text

DIAFree enables untargeted open-search identification for Data-Independent Acquisition data

10.1101/2020.08.30.274209 ◽

2020 ◽

Author(s):

Iris Xu

Keyword(s):

Data Processing ◽

High Throughput ◽

Protein Analysis ◽

Processing Algorithm ◽

Software Suite ◽

Data Processing Algorithm ◽

Link Type ◽

Data Independent Acquisition ◽

Spectral Libraries

AbstractAs a reliable and high-throughput proteomics strategy, data-independent acquisition (DIA) has shown great potential for protein analysis. However, DIA also imposes stress on the data processing algorithm by generating complex multiplexed spectra. Traditionally, DIA data is processed using spectral libraries refined from experiment histories, which requires stable experiment conditions and additional runs. Furthermore, scientists still need to use library-free tools to generate spectral libraries from additional runs. To lessen those burdens, here we present DIAFree(https://github.com/xuesu/DIAFree), a library-free, tag-index-based software suite that enables both restrict search and open search on DIA data using the information of MS1 scans in a precursor-centric and spectrum-centric style. We validate the quality of detection by publicly available data. We further evaluate the quality of spectral libraries produced by DIAFree.

Download Full-text

jFuzzyMachine – An Open–source Fuzzy Logic–based Regulatory Inference Engine for High–throughput Biological Data

10.1101/2020.10.06.315994 ◽

2020 ◽

Author(s):

Paul Aiyetan

Keyword(s):

Fuzzy Logic ◽

High Throughput ◽

Regulatory Network ◽

Network Inference ◽

Hct116 Cell ◽

Inference Engine ◽

Biological Data ◽

Inference System ◽

Apparent Lack ◽

Link Type

AbstractElucidating mechanistic relationships between and among intracellular macromolecules is fundamental to understanding the molecular basis of normal and diseased processes. Here, we introduce jFuzzyMachine – a fuzzy logic-based regulatory network inference engine for high-throughput biological data. We describe its design and implementation. We demonstrate its functions on a sampled expression profile of the vorinostat-resistant HCT116 cell line. We compared jFuzzyMachine’s inferred regulatory network to that inferred by the ARACNe (an Algorithm for the Reconstruction of Gene Regulatory Networks) tool. Potentially more sensitive, jFuzzyMachine showed a slight increase in identified regulatory edges compared to ARACNe. A significant overlap was also observed in the identified edges between the two inference methods. Over 70 percent of edges identified by ARACNe were identified by jFuzzyMachine. Beyond identifying edges, jFuzzyMachine shows direction of interactions, including bidirectional interactions – specifying regulatory inputs and outputs of inferred relationships. jFuzzyMachine addresses an apparent lack of freely available community tool implementing a fuzzy logic regulatory network inference method – mitigating a limitation to applying and extending benefits of the fuzzy inference system to understanding biological data. jFuzzyMachine’s source codes and precompiled binaries are freely available at the Github repository locations:https://github.com/paiyetan/jfuzzymachine andhttps://github.com/paiyetan/jfuzzymachine/releases/tag/v1.7.21.

Download Full-text

Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein-sequence-based replicon distribution scores

10.1101/2020.04.21.053082 ◽

2020 ◽

Cited By ~ 2

Author(s):

Oliver Schwengers ◽

Patrick Barth ◽

Linda Falgenhauer ◽

Torsten Hain ◽

Trinad Chakraborty ◽

...

Keyword(s):

High Throughput ◽

Protein Sequence ◽

Scientific Community ◽

Vital Role ◽

Bacterial Genomes ◽

Short Read ◽

Link Type ◽

Sequencing Technologies ◽

Generation Sequencing

ABSTRACTPlasmids are extrachromosomal genetic elements replicating independently of the chromosome which play a vital role in the environmental adaptation of bacteria. Due to potential mobilization or conjugation capabilities, plasmids are important genetic vehicles for antimicrobial resistance genes and virulence factors with huge and increasing clinical implications. They are therefore subject to large genomic studies within the scientific community worldwide. As a result of rapidly improving next generation sequencing methods, the amount of sequenced bacterial genomes is constantly increasing, in turn raising the need for specialized tools to (i) extract plasmid sequences from draft assemblies, (ii) derive their origin and distribution, and (iii) further investigate their genetic repertoire. Recently, several bioinformatic methods and tools have emerged to tackle this issue; however, a combination of both high sensitivity and specificity in plasmid sequence identification is rarely achieved in a taxon-independent manner. In addition, many software tools are not appropriate for large high-throughput analyses or cannot be included into existing software pipelines due to their technical design or software implementation. In this study, we investigated differences in the replicon distributions of protein-coding genes on a large scale as a new approach to distinguish plasmid-borne from chromosome-borne contigs. We defined and computed statistical discrimination thresholds for a new metric: the replicon distribution score (RDS) which achieved an accuracy of 96.6%. The final performance was further improved by the combination of the RDS metric with heuristics exploiting several plasmid specific higher-level contig characterizations. We implemented this workflow in a new high-throughput taxon-independent bioinformatics software tool called Platon for the recruitment and characterization of plasmid-borne contigs from short-read draft assemblies. Compared to PlasFlow, Platon achieved a higher accuracy (97.5%) and more balanced predictions (F1=82.6%) tested on a broad range of bacterial taxa and better or equal performance against the targeted tools PlasmidFinder and PlaScope on sequenced E. coli isolates. Platon is available at: platon.computational.bioData SummaryPlaton was developed as a Python 3 command line application for Linux.The complete source code and documentation is available on GitHub under a GPL3 license: https://github.com/oschwengers/platon and platon.computational.bio.All database versions are hosted at Zenodo: DOI 10.5281/zenodo.3349651.Platon is available via bioconda package platonPlaton is available via PyPI package cb-platonBacterial representative sequences for UniProt’s UniRef90 protein clusters, complete bacterial genome sequences from the NCBI RefSeq database, complete plasmid sequences from the NCBI genomes plasmid section, created artificial contigs, RDS threshold metrics and raw protein replicon hit counts used to create and evaluate the marker protein sequence database are hosted at Zenodo: DOI 10.5281/zenodo.375916924 Escherichia coli isolates sequenced with short read (Illumina MiSeq) and long read sequencing technologies (Oxford Nanopore Technology GridION platform) used for real data benchmarks are available under the following NCBI BioProjects: PRJNA505407, PRJNA387731Impact StatementPlasmids play a vital role in the spread of antibiotic resistance and pathogenicity genes. The increasing numbers of clinical outbreaks involving resistant pathogens worldwide pushed the scientific community to increase their efforts to comprehensively investigate bacterial genomes. Due to the maturation of next-generation sequencing technologies, nowadays entire bacterial genomes including plasmids are sequenced in huge scale. To analyze draft assemblies, a mandatory first step is to separate plasmid from chromosome contigs. Recently, many bioinformatic tools have emerged to tackle this issue. Unfortunately, several tools are implemented only as interactive or web-based tools disabling them for necessary high-throughput analysis of large data sets. Other tools providing such a high-throughput implementation however often come with certain drawbacks, e.g. providing taxon-specific databases only, not providing actionable, i.e. true binary classification or achieving biased classification performances towards either sensitivity or specificity.Here, we introduce the tool Platon implementing a new replicon distribution-based approach combined with higher-level contig characterizations to address the aforementioned issues. In addition to the plasmid detection within draft assemblies, Platon provides the user with valuable information on certain higher-level contig characterizations. We show that Platon provides a balanced classification performance as well as a scalable implementation for high-throughput analyses. We therefore consider Platon to be a powerful, species-independent and flexible tool to scan large amounts of bacterial whole-genome sequencing data for their plasmid content.

Download Full-text