Performing Highly Efficient Genome Scans for Local Adaptation with R Package pcadapt Version 4

Florian Privé; Keurcien Luu; Bjarni J Vilhjálmsson; Michael G B Blum

doi:10.1093/molbev/msaa053

Performing Highly Efficient Genome Scans for Local Adaptation with R Package pcadapt Version 4

Molecular Biology and Evolution ◽

10.1093/molbev/msaa053 ◽

2020 ◽

Vol 37 (7) ◽

pp. 2153-2154 ◽

Cited By ~ 2

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Bjarni J Vilhjálmsson ◽

Michael G B Blum

Keyword(s):

Local Adaptation ◽

Principal Components ◽

Computational Efficiency ◽

Computation Time ◽

R Package ◽

Large Reduction ◽

Genome Scans ◽

Highly Efficient ◽

User Friendly

Abstract R package pcadapt is a user-friendly R package for performing genome scans for local adaptation. Here, we present version 4 of pcadapt which substantially improves computational efficiency while providing similar results. This improvement is made possible by using a different format for storing genotypes and a different algorithm for computing principal components of the genotype matrix, which is the most computationally demanding step in method pcadapt. These changes are seamlessly integrated into the existing pcadapt package, and users will experience a large reduction in computation time (by a factor of 20–60 in our analyses) as compared with previous versions.

Download Full-text

pcaExplorer: an R/Bioconductor package for interacting with RNA-seq principal components

10.1101/493551 ◽

2018 ◽

Cited By ~ 1

Author(s):

Federico Marini ◽

Harald Binder

Keyword(s):

Principal Components ◽

Principal Component ◽

R Package ◽

Expression Data ◽

Rna Seq ◽

Functional Interpretation ◽

Software Packages ◽

Bioconductor Project ◽

Interactive Data ◽

User Friendly

AbstractBackgroundPrincipal component analysis (PCA) is frequently useentirely written ind in genomics applications for quality assessment and exploratory analysis in high-dimensional data, such as RNA sequencing (RNA-seq) gene expression assays. Despite the availability of many software packages developed for this purpose, an interactive and comprehensive interface for performing these operations is lacking.ResultsWe developed the pcaExplorer software package to enhance commonly performed analysis steps with an interactive and user-friendly application, which provides state saving as well as the automated creation of reproducible reports. pcaExplorer is implemented in R using the Shiny framework and exploits data structures from the open-source Bioconductor project. Users can easily generate a wide variety of publication-ready graphs, while assessing the expression data in the different modules available, including a general overview, dimension reduction on samples and genes, as well as functional interpretation of the principal components.ConclusionpcaExplorer is distributed as an R package in the Bioconductor project (http://bioconductor.org/packages/pcaExplorer/), and is designed to assist a broad range of researchers in the critical step of interactive data exploration.

Download Full-text

DIscBIO: A User-Friendly Pipeline for Biomarker Discovery in Single-Cell Transcriptomics

International Journal of Molecular Sciences ◽

10.3390/ijms22031399 ◽

2021 ◽

Vol 22 (3) ◽

pp. 1399

Author(s):

Salim Ghannoum ◽

Waldir Leoncio Netto ◽

Damiano Fantini ◽

Benjamin Ragan-Kelley ◽

Amirabbas Parizadeh ◽

...

Keyword(s):

Single Cell ◽

Biomarker Discovery ◽

Enrichment Analysis ◽

Myxoid Liposarcoma ◽

R Package ◽

Differential Analysis ◽

A Cell ◽

Reproducible Analysis ◽

Transcriptomic Level ◽

User Friendly

The growing attention toward the benefits of single-cell RNA sequencing (scRNA-seq) is leading to a myriad of computational packages for the analysis of different aspects of scRNA-seq data. For researchers without advanced programing skills, it is very challenging to combine several packages in order to perform the desired analysis in a simple and reproducible way. Here we present DIscBIO, an open-source, multi-algorithmic pipeline for easy, efficient and reproducible analysis of cellular sub-populations at the transcriptomic level. The pipeline integrates multiple scRNA-seq packages and allows biomarker discovery with decision trees and gene enrichment analysis in a network context using single-cell sequencing read counts through clustering and differential analysis. DIscBIO is freely available as an R package. It can be run either in command-line mode or through a user-friendly computational pipeline using Jupyter notebooks. We showcase all pipeline features using two scRNA-seq datasets. The first dataset consists of circulating tumor cells from patients with breast cancer. The second one is a cell cycle regulation dataset in myxoid liposarcoma. All analyses are available as notebooks that integrate in a sequential narrative R code with explanatory text and output data and images. R users can use the notebooks to understand the different steps of the pipeline and will guide them to explore their scRNA-seq data. We also provide a cloud version using Binder that allows the execution of the pipeline without the need of downloading R, Jupyter or any of the packages used by the pipeline. The cloud version can serve as a tutorial for training purposes, especially for those that are not R users or have limited programing skills. However, in order to do meaningful scRNA-seq analyses, all users will need to understand the implemented methods and their possible options and limitations.

Download Full-text

BREC: an R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

BMC Bioinformatics ◽

10.1186/s12859-021-04233-1 ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Yasmine Mansour ◽

Annie Chateau ◽

Anna-Sophie Fiston-Lavier

Keyword(s):

Data Quality ◽

Data Science ◽

Fruit Fly ◽

R Package ◽

Model Organisms ◽

Data Quality Control ◽

Recombination Rates ◽

Functional Dynamics ◽

Shiny App ◽

User Friendly

Abstract Background Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https://github.com/GenomeStructureOrganization.

Download Full-text

Metastatic heterogeneity of the consensus molecular subtypes of colorectal cancer

npj Genomic Medicine ◽

10.1038/s41525-021-00223-7 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Peter W. Eide ◽

Seyed H. Moosavi ◽

Ina A. Eilertsen ◽

Tuva H. Brunsell ◽

Jonas Langerud ◽

...

Keyword(s):

Gene Expression ◽

Colorectal Cancer ◽

Principal Components ◽

Prognostic Value ◽

Tumor Heterogeneity ◽

Molecular Subtypes ◽

R Package ◽

Data Set ◽

Primary Tumors ◽

External Data

AbstractGene expression-based subtypes of colorectal cancer have clinical relevance, but the representativeness of primary tumors and the consensus molecular subtypes (CMS) for metastatic cancers is not well known. We investigated the metastatic heterogeneity of CMS. The best approach to subtype translation was delineated by comparisons of transcriptomic profiles from 317 primary tumors and 295 liver metastases, including multi-metastatic samples from 45 patients and 14 primary-metastasis sets. Associations were validated in an external data set (n = 618). Projection of metastases onto principal components of primary tumors showed that metastases were depleted of CMS1-immune/CMS3-metabolic signals, enriched for CMS4-mesenchymal/stromal signals, and heavily influenced by the microenvironment. The tailored CMS classifier (available in an updated version of the R package CMScaller) therefore implemented an approach to regress out the liver tissue background. The majority of classified metastases were either CMS2 or CMS4. Nonetheless, subtype switching and inter-metastatic CMS heterogeneity were frequent and increased with sampling intensity. Poor-prognostic value of CMS1/3 metastases was consistent in the context of intra-patient tumor heterogeneity.

Download Full-text

SNP-level FST outperforms window statistics for detecting soft sweeps in local adaptation

10.1101/2022.01.12.476036 ◽

2022 ◽

Author(s):

Tiago da Silva Ribeiro ◽

José A Galván ◽

John E Pool

Keyword(s):

Genetic Differentiation ◽

Local Adaptation ◽

Genetic Variant ◽

Real Data ◽

Selective Sweeps ◽

Functional Categories ◽

Genome Scans ◽

Genome Wide ◽

A Genome ◽

Local Selection

Local adaptation can lead to elevated genetic differentiation at the targeted genetic variant and nearby sites. Selective sweeps come in different forms, and depending on the initial and final frequencies of a favored variant, very different patterns of genetic variation may be produced. If local selection favors an existing variant that had already recombined onto multiple genetic backgrounds, then the width of elevated genetic differentiation (high FST) may be too narrow to detect using a typical windowed genome scan, even if the targeted variant becomes highly differentiated. We therefore used a simulation approach to investigate the power of SNP-level FST (specifically, the maximum SNP FST value within a window) to detect diverse scenarios of local adaptation, and compared it against whole-window FST and the Comparative Haplotype Identity statistic. We found that SNP FST had superior power to detect complete or mostly complete soft sweeps, but lesser power than window-wide statistics to detect partial hard sweeps. To investigate the relative enrichment and nature of SNP FST outliers from real data, we applied the two FST statistics to a panel of Drosophila melanogaster populations. We found that SNP FST had a genome-wide enrichment of outliers compared to demographic expectations, and though it yielded a lesser enrichment than window FST, it detected mostly unique outlier genes and functional categories. Our results suggest that SNP FST is highly complementary to typical window-based approaches for detecting local adaptation, and merits inclusion in future genome scans and methodologies.

Download Full-text

Introducing bdclean: a user friendly biodiversity data cleaning pipeline

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25564 ◽

2018 ◽

Vol 2 ◽

pp. e25564

Author(s):

Tomer Gueta ◽

Vijay Barve ◽

Thiloshon Nagarajah ◽

Ashwin Agrawal ◽

Yohay Carmel

Keyword(s):

Data Cleaning ◽

Control Process ◽

R Package ◽

Modular Approach ◽

Data Validation ◽

Biodiversity Data ◽

Quality Control Process ◽

Cleaning Procedures ◽

R Packages ◽

User Friendly

A new R package for biodiversity data cleaning, 'bdclean', was initiated in the Google Summer of Code (GSoC) 2017 and is available on github. Several R packages have great data validation and cleaning functions, but 'bdclean' provides features to manage a complete pipeline for biodiversity data cleaning; from data quality explorations, to cleaning procedures and reporting. Users are able go through the quality control process in a very structured, intuitive, and effective way. A modular approach to data cleaning functionality should make this package extensible for many biodiversity data cleaning needs. Under GSoC 2018, 'bdclean' will go through a comprehensive upgrade. New features will be highlighted in the demonstration.

Download Full-text

FIREcaller: Detecting Frequently Interacting Regions from Hi-C Data

10.1101/619288 ◽

2019 ◽

Cited By ~ 3

Author(s):

Cheynna Crowley ◽

Yuchen Yang ◽

Yunjiang Qiu ◽

Benxia Hu ◽

Armen Abnousi ◽

...

Keyword(s):

Gene Regulation ◽

Spatial Organization ◽

R Package ◽

Specific Gene ◽

List Type ◽

Cell Type ◽

R Software ◽

Computational Tools ◽

Cell Type Specific ◽

User Friendly

AbstractHi-C experiments have been widely adopted to study chromatin spatial organization, which plays an essential role in genome function. We have recently identified frequently interacting regions (FIREs) and found that they are closely associated with cell-type-specific gene regulation. However, computational tools for detecting FIREs from Hi-C data are still lacking. In this work, we present FIREcaller, a stand-alone, user-friendly R package for detecting FIREs from Hi-C data. FIREcaller takes raw Hi-C contact matrices as input, performs within-sample and cross-sample normalization, and outputs continuous FIRE scores, dichotomous FIREs, and super-FIREs. Applying FIREcaller to Hi-C data from various human tissues, we demonstrate that FIREs and super-FIREs identified, in a tissue-specific manner, are closely related to gene regulation, are enriched for enhancer-promoter (E-P) interactions, tend to overlap with regions exhibiting epigenomic signatures of cis-regulatory roles, and aid the interpretation or GWAS variants. The FIREcaller package is implemented in R and freely available at https://yunliweb.its.unc.edu/FIREcaller.Highlights– Frequently Interacting Regions (FIREs) can be used to identify tissue and cell-type-specific cis-regulatory regions.– An R software, FIREcaller, has been developed to identify FIREs and clustered FIREs into super-FIREs.

Download Full-text

Improved accuracy and computational efficiency in virtual acoustic rendering using principal components-based amplitude panning

The Journal of the Acoustical Society of America ◽

10.1121/10.0008350 ◽

2021 ◽

Vol 150 (4) ◽

pp. A299-A299

Author(s):

Matthew Neal ◽

Pavel Zahorik

Keyword(s):

Principal Components ◽

Computational Efficiency ◽

Improved Accuracy

Download Full-text

Joint DNA-based Disaster Victim Identification

10.21203/rs.3.rs-296414/v1 ◽

2021 ◽

Author(s):

Magnus Dehli Vigeland ◽

Thore Egeland

Keyword(s):

A Priori ◽

Search Space ◽

R Package ◽

Individual Identification ◽

Disaster Victim Identification ◽

Data Set ◽

Victim Identification ◽

Joint Identification ◽

Available Information ◽

User Friendly

Abstract We address computational and statistical aspects of DNA-based identification of victims in the aftermath of disasters. Current methods and software for such identification typically consider each victim individually, leading to suboptimal power of identification and potential inconsistencies in the statistical summary of the evidence. We resolve these problems by performing joint identification of all victims, using the complete genetic data set. Individual identification probabilities, conditional on all available information, are derived from the joint solution in the form of posterior pairing probabilities. A closed formula is obtained for the a priori number of possible joint solutions to a given DVI problem. This number increases quickly with the number of victims and missing persons, posing computational challenges for brute force approaches. We address this complexity with a preparatory sequential step aiming to reduce the search space. The examples show that realistic cases are handled efficiently. User-friendly implementations of all methods are provided in the R package dvir, freely available on all platforms.

Download Full-text

DiscoRhythm: an easy-to-use web application and R package for discovering rhythmicity

Bioinformatics ◽

10.1093/bioinformatics/btz834 ◽

2019 ◽

Cited By ~ 2

Author(s):

Matthew Carlucci ◽

Algimantas Kriščiūnas ◽

Haohan Li ◽

Povilas Gibas ◽

Karolis Koncevičius ◽

...

Keyword(s):

Web Application ◽

Statistical Significance ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Statistical Knowledge ◽

Health And Disease ◽

Phase Amplitude ◽

Almost All ◽

User Friendly

Abstract Motivation Biological rhythmicity is fundamental to almost all organisms on Earth and plays a key role in health and disease. Identification of oscillating signals could lead to novel biological insights, yet its investigation is impeded by the extensive computational and statistical knowledge required to perform such analysis. Results To address this issue, we present DiscoRhythm (Discovering Rhythmicity), a user-friendly application for characterizing rhythmicity in temporal biological data. DiscoRhythm is available as a web application or an R/Bioconductor package for estimating phase, amplitude, and statistical significance using four popular approaches to rhythm detection (Cosinor, JTK Cycle, ARSER, and Lomb-Scargle). We optimized these algorithms for speed, improving their execution times up to 30-fold to enable rapid analysis of -omic-scale datasets in real-time. Informative visualizations, interactive modules for quality control, dimensionality reduction, periodicity profiling, and incorporation of experimental replicates make DiscoRhythm a thorough toolkit for analyzing rhythmicity. Availability and Implementation The DiscoRhythm R package is available on Bioconductor (https://bioconductor.org/packages/DiscoRhythm), with source code available on GitHub (https://github.com/matthewcarlucci/DiscoRhythm) under a GPL-3 license. The web application is securely deployed over HTTPS (https://disco.camh.ca) and is freely available for use worldwide. Local instances of the DiscoRhythm web application can be created using the R package or by deploying the publicly available Docker container (https://hub.docker.com/r/mcarlucci/discorhythm). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text