Curation of over 10 000 transcriptomic studies to enable data reuse

Database ◽

10.1093/database/baab006 ◽

2021 ◽

Vol 2021 ◽

Author(s):

Nathaniel Lim ◽

Stepan Tesar ◽

Manuel Belmadani ◽

Guillaume Poirier-Morency ◽

Burak Ogan Mancarci ◽

...

Keyword(s):

Quality Control ◽

Web Services ◽

Data Processing ◽

R Package ◽

Data Reuse ◽

Web Interface ◽

Transcriptomic Data ◽

Restful Service ◽

Public Repositories ◽

Transcriptomic Studies

Abstract Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe–gene mappings across microarray technologies. Thus, extensive curation and data reprocessing are necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10 811 manually curated datasets (primarily human, mouse and rat), over 395 000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA sequencing). Dataset topics were represented with 10 215 distinct terms from 12 ontologies, for a total of 54 316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service and an R package. Database URL: https://gemma.msl.ubc.ca/home.html

Download Full-text

Curation of over 10,000 transcriptomic studies to enable data reuse

10.1101/2020.07.13.201442 ◽

2020 ◽

Author(s):

Nathaniel Lim ◽

Stepan Tesar ◽

Manuel Belmadani ◽

Guillaume Poirier-Morency ◽

Burak Ogan Mancarci ◽

...

Keyword(s):

Quality Control ◽

Web Services ◽

Data Processing ◽

R Package ◽

Data Reuse ◽

Web Interface ◽

Transcriptomic Data ◽

Restful Service ◽

Public Repositories ◽

Transcriptomic Studies

AbstractVast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe-gene mappings across microarray technologies. Thus, extensive curation and data reprocessing is necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface, and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10,811 manually curated datasets (primarily human, mouse, and rat), over 395,000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA-sequencing). Dataset topics were represented with 10,215 distinct terms from 12 ontologies, for a total of 54,316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service, and an R package.Database URL: https://gemma.msl.ubc.ca/home.html

Download Full-text

GladiaTOX: GLobal Assessment of Dose-IndicAtor in TOXicology

Bioinformatics ◽

10.1093/bioinformatics/btz187 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4190-4192 ◽

Cited By ~ 4

Author(s):

Vincenzo Belcastro ◽

Stephane Cano ◽

Diego Marescotti ◽

Stefano Acali ◽

Carine Poussin ◽

...

Keyword(s):

Quality Control ◽

Data Processing ◽

Web Service ◽

Biomedical Research ◽

Global Assessment ◽

R Package ◽

Supplementary Information ◽

High Content Screening ◽

Supplementary Data ◽

Severity Scores

Abstract Summary GladiaTOX R package is an open-source, flexible solution to high-content screening data processing and reporting in biomedical research. GladiaTOX takes advantage of the ‘tcpl’ core functionalities and provides a number of extensions: it provides a web-service solution to fetch raw data; it computes severity scores and exports ToxPi formatted files; furthermore it contains a suite of functionalities to generate PDF reports for quality control and data processing. Availability and implementation GladiaTOX R package (bioconductor). Also available via: git clone https://github.com/philipmorrisintl/GladiaTOX.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

webMCP-counter: a web interface for transcriptomics-based quantification of immune and stromal cells in heterogeneous human or murine samples

10.1101/2020.12.03.400754 ◽

2020 ◽

Author(s):

Maxime Meylan ◽

Etienne Becht ◽

Catherine Sautès-Fridman ◽

Aurélien de Reyniès ◽

Wolf H. Fridman ◽

...

Keyword(s):

Stromal Cells ◽

R Package ◽

Web Interface ◽

Transcriptomic Data ◽

R Programming Language ◽

Link Type ◽

Precise Estimation ◽

R Packages ◽

R Programming ◽

User Friendly

AbstractSummaryWe previously reported MCP-counter and mMCP-counter, methods that allow precise estimation of the immune and stromal composition of human and murine samples from bulk transcriptomic data, but they were only distributed as R packages. Here, we report webMCP-counter, a user-friendly web interface to allow all users to use these methods, regardless of their proficiency in the R programming language.Availability and ImplementationFreely available from http://134.157.229.105:3838/webMCP/. Website developed with the R package shiny. Source code available from GitHub: https://github.com/FPetitprez/webMCP-counter.

Download Full-text

The state of the art in soybean transcriptomics resources and gene coexpression networks

in silico Plants ◽

10.1093/insilicoplants/diab005 ◽

2021 ◽

Author(s):

Fabricio Almeida-Silva ◽

Kanhu C Moharana ◽

Thiago M Venancio

Keyword(s):

State Of The Art ◽

The State ◽

Gene Coexpression Network ◽

Rna Seq ◽

Transcriptomic Data ◽

The Past ◽

Gene Coexpression ◽

Genomics Research ◽

Public Repositories ◽

Coexpression Networks

Abstract In the past decade, over 3000 samples of soybean transcriptomic data have accumulated in public repositories. Here, we review the state of the art in soybean transcriptomics, highlighting the major microarray and RNA-seq studies that investigated soybean transcriptional programs in different tissues and conditions. Further, we propose approaches for integrating such big data using gene coexpression network and outline important web resources that may facilitate soybean data acquisition and analysis, contributing to the acceleration of soybean breeding and functional genomics research.

Download Full-text

MassyTools: A High-Throughput Targeted Data Processing Tool for Relative Quantitation and Quality Control Developed for Glycomic and Glycoproteomic MALDI-MS

Journal of Proteome Research ◽

10.1021/acs.jproteome.5b00658 ◽

2015 ◽

Vol 14 (12) ◽

pp. 5088-5098 ◽

Cited By ~ 58

Author(s):

Bas C. Jansen ◽

Karli R. Reiding ◽

Albert Bondt ◽

Agnes L. Hipgrave Ederveen ◽

Magnus Palmblad ◽

...

Keyword(s):

Quality Control ◽

Data Processing ◽

High Throughput ◽

Maldi Ms ◽

Relative Quantitation ◽

Data Processing Tool

Download Full-text

Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control

Scientific Data ◽

10.1038/sdata.2014.12 ◽

2014 ◽

Vol 1 (1) ◽

Cited By ~ 74

Author(s):

Jennifer A Kirwan ◽

Ralf J M Weber ◽

David I Broadhurst ◽

Mark R Viant

Keyword(s):

Mass Spectrometry ◽

Quality Control ◽

Data Processing ◽

Direct Infusion ◽

Direct Infusion Mass Spectrometry

Download Full-text

SpiderSeqR: an R package for crawling the web of high-throughput multi-omic data repositories for data-sets and annotation

10.1101/2020.04.13.039420 ◽

2020 ◽

Author(s):

Anna M. Sozanska ◽

Charles Fletcher ◽

Dóra Bihary ◽

Shamith A. Samarajiwa

Keyword(s):

High Throughput ◽

R Package ◽

Data Reuse ◽

Massively Parallel ◽

Data Sets ◽

Similar Data ◽

Data Generation ◽

Data Repositories ◽

Public Data ◽

Omic Data

AbstractMore than three decades ago, the microarray revolution brought about high-throughput data generation capability to biology and medicine. Subsequently, the emergence of massively parallel sequencing technologies led to many big-data initiatives such as the human genome project and the encyclopedia of DNA elements (ENCODE) project. These, in combination with cheaper, faster massively parallel DNA sequencing capabilities, have democratised multi-omic (genomic, transcriptomic, translatomic and epigenomic) data generation leading to a data deluge in bio-medicine. While some of these data-sets are trapped in inaccessible silos, the vast majority of these data-sets are stored in public data resources and controlled access data repositories, enabling their wider use (or misuse). Currently, most peer reviewed publications require the deposition of the data-set associated with a study under consideration in one of these public data repositories. However, clunky and difficult to use interfaces, subpar or incomplete annotation prevent discovering, searching and filtering of these multi-omic data and hinder their re-purposing in other use cases. In addition, the proliferation of multitude of different data repositories, with partially redundant storage of similar data are yet another obstacle to their continued usefulness. Similarly, interfaces where annotation is spread across multiple web pages, use of accession identifiers with ambiguous and multiple interpretations and lack of good curation make these data-sets difficult to use. We have produced SpiderSeqR, an R package, whose main features include the integration between NCBI GEO and SRA databases, enabling an integrated unified search of SRA and GEO data-sets and associated annotations, conversion between database accessions, as well as convenient filtering of results and saving past queries for future use. All of the above features aim to promote data reuse to facilitate making new discoveries and maximising the potential of existing data-sets.Availabilityhttps://github.com/ss-lab-cancerunit/SpiderSeqR

Download Full-text

Automatic data processing and quality control: experiences from ISO-LWS

10.1117/12.316508 ◽

1998 ◽

Cited By ~ 1

Author(s):

Martin J. Burgdorf ◽

A. S. Harwood ◽

N. R. Trams ◽

Tanya L. Lim ◽

Sunil D. Sidher ◽

...

Keyword(s):

Quality Control ◽

Data Processing ◽

Automatic Data Processing ◽

Automatic Data

Download Full-text

ngsReports: a Bioconductor package for managing FastQC reports and other NGS related log files

Bioinformatics ◽

10.1093/bioinformatics/btz937 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2587-2588 ◽

Cited By ~ 10

Author(s):

Christopher M Ward ◽

Thu-Hien To ◽

Stephen M Pederson

Keyword(s):

Quality Control ◽

R Package ◽

Supplementary Information ◽

Bioconductor Package ◽

Supplementary Data ◽

Large Sample ◽

Log Files ◽

Shiny App ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Motivation High throughput next generation sequencing (NGS) has become exceedingly cheap, facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and the outputs of popular bioinformatics tools such as FastQC and Picard can provide information on individual samples. Although these tools provide considerable power when carrying out QC, large sample numbers can make inspection of all samples and identification of systemic bias a challenge. Results We present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC reports along with outputs from other tools. Visualization can be carried out across many samples using default, highly customizable plots with options to perform hierarchical clustering to quickly identify outlier libraries. Moreover, these can be displayed in an interactive shiny app or HTML report for ease of analysis. Availability and implementation The ngsReports package is available on Bioconductor and the GUI shiny app is available at https://github.com/UofABioinformaticsHub/shinyNgsreports. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Investigation of Stark line broadening within spectral series of potassium and copper isoelectronic sequences

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2367 ◽

2019 ◽

Vol 489 (3) ◽

pp. 2997-3002 ◽

Cited By ~ 1

Author(s):

Ivan P Dojčinović ◽

Nora Trklja ◽

Irinel Tapalaga ◽

Jagoš Purić

Keyword(s):

Quality Control ◽

Data Processing ◽

Spectral Series ◽

Line Broadening ◽

Electronic Density ◽

Stark Width

Abstract We have investigated Stark line broadening within the spectral series of potassium-like and copper-like emitters, both separately and together. The analysis was performed for fixed values of electronic density Ne = 1022 m−3 and temperature $T = 100\, 000$ K. Algorithms made for fast data processing also serve for temperature and density normalization of data. Relations obtained using the regularity-based analysis enable predictions of Stark widths for transitions that have not yet been calculated or measured. Results of present investigation can be used for quality control of available Stark width data.

Download Full-text