scholarly journals Curation of over 10 000 transcriptomic studies to enable data reuse

Database ◽  
2021 ◽  
Vol 2021 ◽  
Author(s):  
Nathaniel Lim ◽  
Stepan Tesar ◽  
Manuel Belmadani ◽  
Guillaume Poirier-Morency ◽  
Burak Ogan Mancarci ◽  
...  

Abstract Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe–gene mappings across microarray technologies. Thus, extensive curation and data reprocessing are necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10 811 manually curated datasets (primarily human, mouse and rat), over 395 000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA sequencing). Dataset topics were represented with 10 215 distinct terms from 12 ontologies, for a total of 54 316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service and an R package. Database URL: https://gemma.msl.ubc.ca/home.html

2020 ◽  
Author(s):  
Nathaniel Lim ◽  
Stepan Tesar ◽  
Manuel Belmadani ◽  
Guillaume Poirier-Morency ◽  
Burak Ogan Mancarci ◽  
...  

AbstractVast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe-gene mappings across microarray technologies. Thus, extensive curation and data reprocessing is necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface, and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10,811 manually curated datasets (primarily human, mouse, and rat), over 395,000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA-sequencing). Dataset topics were represented with 10,215 distinct terms from 12 ontologies, for a total of 54,316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service, and an R package.Database URL: https://gemma.msl.ubc.ca/home.html


2019 ◽  
Vol 35 (20) ◽  
pp. 4190-4192 ◽  
Author(s):  
Vincenzo Belcastro ◽  
Stephane Cano ◽  
Diego Marescotti ◽  
Stefano Acali ◽  
Carine Poussin ◽  
...  

Abstract Summary GladiaTOX R package is an open-source, flexible solution to high-content screening data processing and reporting in biomedical research. GladiaTOX takes advantage of the ‘tcpl’ core functionalities and provides a number of extensions: it provides a web-service solution to fetch raw data; it computes severity scores and exports ToxPi formatted files; furthermore it contains a suite of functionalities to generate PDF reports for quality control and data processing. Availability and implementation GladiaTOX R package (bioconductor). Also available via: git clone https://github.com/philipmorrisintl/GladiaTOX.git. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Maxime Meylan ◽  
Etienne Becht ◽  
Catherine Sautès-Fridman ◽  
Aurélien de Reyniès ◽  
Wolf H. Fridman ◽  
...  

AbstractSummaryWe previously reported MCP-counter and mMCP-counter, methods that allow precise estimation of the immune and stromal composition of human and murine samples from bulk transcriptomic data, but they were only distributed as R packages. Here, we report webMCP-counter, a user-friendly web interface to allow all users to use these methods, regardless of their proficiency in the R programming language.Availability and ImplementationFreely available from http://134.157.229.105:3838/webMCP/. Website developed with the R package shiny. Source code available from GitHub: https://github.com/FPetitprez/webMCP-counter.


Author(s):  
Fabricio Almeida-Silva ◽  
Kanhu C Moharana ◽  
Thiago M Venancio

Abstract In the past decade, over 3000 samples of soybean transcriptomic data have accumulated in public repositories. Here, we review the state of the art in soybean transcriptomics, highlighting the major microarray and RNA-seq studies that investigated soybean transcriptional programs in different tissues and conditions. Further, we propose approaches for integrating such big data using gene coexpression network and outline important web resources that may facilitate soybean data acquisition and analysis, contributing to the acceleration of soybean breeding and functional genomics research.


2015 ◽  
Vol 14 (12) ◽  
pp. 5088-5098 ◽  
Author(s):  
Bas C. Jansen ◽  
Karli R. Reiding ◽  
Albert Bondt ◽  
Agnes L. Hipgrave Ederveen ◽  
Magnus Palmblad ◽  
...  

2020 ◽  
Author(s):  
Anna M. Sozanska ◽  
Charles Fletcher ◽  
Dóra Bihary ◽  
Shamith A. Samarajiwa

AbstractMore than three decades ago, the microarray revolution brought about high-throughput data generation capability to biology and medicine. Subsequently, the emergence of massively parallel sequencing technologies led to many big-data initiatives such as the human genome project and the encyclopedia of DNA elements (ENCODE) project. These, in combination with cheaper, faster massively parallel DNA sequencing capabilities, have democratised multi-omic (genomic, transcriptomic, translatomic and epigenomic) data generation leading to a data deluge in bio-medicine. While some of these data-sets are trapped in inaccessible silos, the vast majority of these data-sets are stored in public data resources and controlled access data repositories, enabling their wider use (or misuse). Currently, most peer reviewed publications require the deposition of the data-set associated with a study under consideration in one of these public data repositories. However, clunky and difficult to use interfaces, subpar or incomplete annotation prevent discovering, searching and filtering of these multi-omic data and hinder their re-purposing in other use cases. In addition, the proliferation of multitude of different data repositories, with partially redundant storage of similar data are yet another obstacle to their continued usefulness. Similarly, interfaces where annotation is spread across multiple web pages, use of accession identifiers with ambiguous and multiple interpretations and lack of good curation make these data-sets difficult to use. We have produced SpiderSeqR, an R package, whose main features include the integration between NCBI GEO and SRA databases, enabling an integrated unified search of SRA and GEO data-sets and associated annotations, conversion between database accessions, as well as convenient filtering of results and saving past queries for future use. All of the above features aim to promote data reuse to facilitate making new discoveries and maximising the potential of existing data-sets.Availabilityhttps://github.com/ss-lab-cancerunit/SpiderSeqR


1998 ◽  
Author(s):  
Martin J. Burgdorf ◽  
A. S. Harwood ◽  
N. R. Trams ◽  
Tanya L. Lim ◽  
Sunil D. Sidher ◽  
...  

2019 ◽  
Vol 36 (8) ◽  
pp. 2587-2588 ◽  
Author(s):  
Christopher M Ward ◽  
Thu-Hien To ◽  
Stephen M Pederson

Abstract Motivation High throughput next generation sequencing (NGS) has become exceedingly cheap, facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and the outputs of popular bioinformatics tools such as FastQC and Picard can provide information on individual samples. Although these tools provide considerable power when carrying out QC, large sample numbers can make inspection of all samples and identification of systemic bias a challenge. Results We present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC reports along with outputs from other tools. Visualization can be carried out across many samples using default, highly customizable plots with options to perform hierarchical clustering to quickly identify outlier libraries. Moreover, these can be displayed in an interactive shiny app or HTML report for ease of analysis. Availability and implementation The ngsReports package is available on Bioconductor and the GUI shiny app is available at https://github.com/UofABioinformaticsHub/shinyNgsreports. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 489 (3) ◽  
pp. 2997-3002 ◽  
Author(s):  
Ivan P Dojčinović ◽  
Nora Trklja ◽  
Irinel Tapalaga ◽  
Jagoš Purić

Abstract We have investigated Stark line broadening within the spectral series of potassium-like and copper-like emitters, both separately and together. The analysis was performed for fixed values of electronic density Ne = 1022 m−3 and temperature $T = 100\, 000$ K. Algorithms made for fast data processing also serve for temperature and density normalization of data. Relations obtained using the regularity-based analysis enable predictions of Stark widths for transitions that have not yet been calculated or measured. Results of present investigation can be used for quality control of available Stark width data.


Sign in / Sign up

Export Citation Format

Share Document