Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering

Cancer Informatics ◽

10.4137/cin.s33076 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S33076 ◽

Cited By ~ 1

Author(s):

Kevin K. Mcdade ◽

Uma Chandran ◽

Roger S. Day

Keyword(s):

Data Quality ◽

Research Question ◽

Quality Data ◽

Data Sets ◽

Expression Data ◽

Test Bed ◽

Filtering Method ◽

Proteomic Data ◽

Combination Strategies ◽

Test Beds

Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other.

Download Full-text

Software and data quality

Agricultural Economics (Zemědělská ekonomika) ◽

10.17221/5007-agricecon ◽

2012 ◽

Vol 52 (No. 3) ◽

pp. 138-146 ◽

Cited By ~ 1

Author(s):

J. Vaníček

Keyword(s):

Data Quality ◽

Software Quality ◽

Main Idea ◽

Critical Discussion ◽

International Standards ◽

Quality Characteristic ◽

Quality Data ◽

Data Sets ◽

International Standard ◽

New Ideas

The paper presents new ideas in the International SQuaRE (Software Quality Requirements and Evaluation) standardisation research project, which concerns the development of a special branch of international standards for software quality. Data can be considered as an integral part of software. The current international standard and technical report of the ISO/IEC 9126, ISO/IEC 14598 series and ISO/IEC 12119 standard covert the whole software as an indivisible entity. However, such data sets as databases and data stores have a special character and need a different structure of quality characteristic. Therefore it was decided in the SQuaRE project create a special international standard for data quality. The main idea for this standard and the critical discussion of these ideas is presented in this paper. The main part of this contribution was presented on the conference Agricultural Perspectives XIV, aligned by Czech University of Agriculture in Prague, September 20 to 21, 2005.

Download Full-text

ExpressWeb: A Web application for clustering and visualization of expression data

10.1101/625939 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bruno Savelli ◽

Sylvain Picard ◽

Christophe Roux ◽

Christophe Dunand

Keyword(s):

Web Application ◽

Limiting Factors ◽

Data Sets ◽

Expression Data ◽

Web Tool ◽

Scientific Publications ◽

Gene Annotations ◽

Proteomic Data ◽

Graphs And Networks ◽

Online Web

ABSTRACTThe recent explosion of transcriptomics and proteomic data have resulted in vast amounts of datasets without connection and sometime too large to be easily analysed. Integration between datasets and analysis of an extracted datasets are limiting factors which need to be solved in order to make full use of the data and to connect data.ExpressWeb is an online web tool that combines a Taylor clustering of expressed data sets to extract gene network with gene annotations to visualise the co-expression network. Data sets can become from personal or publically experiments. ExpressWeb allows to easily compute clustering on expression data and provides friendly and useful visualisation tools as heatmaps, graphs and networks, generating output images which can be used for scientific publications.

Download Full-text

MatrixQCvis: shiny-based interactive data quality exploration for omics data

10.1101/2021.06.17.448827 ◽

2021 ◽

Author(s):

Thomas Naake ◽

Wolfgang Huber

Keyword(s):

Data Analysis ◽

Data Quality ◽

Quality Metrics ◽

Quality Data ◽

Data Sets ◽

Omics Data ◽

Data Types ◽

Analysis Workflow ◽

Data Quality Metrics

Motivation: First-line data quality assessment and exploratory data analysis are integral parts of any data analysis workflow. In high-throughput quantitative omics experiments (e.g. transcriptomics, proteomics, metabolomics), after initial processing, the data are typically presented as a matrix of numbers (feature IDs x samples). Efficient and standardized data-quality metrics calculation and visualization are key to track the within-experiment quality of these rectangular data types and to guarantee for high-quality data sets and subsequent biological question-driven inference. Results: We present MatrixQCvis, which provides interactive visualization of data quality metrics at the per-sample and per-feature level using R's shiny framework. It provides efficient and standardized ways to analyze data quality of quantitative omics data types that come in a matrix-like format (features IDs x samples). MatrixQCvis builds upon the Bioconductor SummarizedExperiment S4 class and thus facilitates the integration into existing workflows. Availability: MatrixQCVis is implemented in R. It is available via Bioconductor and released under the GPL v3.0 license.

Download Full-text

Energetics of interactions in the solid state of 2-hydroxy-8-X-quinoline derivatives (X = Cl, Br, I, S-Ph): comparison of Hirshfeld atom, X-ray wavefunction and multipole refinements

IUCrJ ◽

10.1107/s2052252519007358 ◽

2019 ◽

Vol 6 (5) ◽

pp. 868-883 ◽

Cited By ~ 3

Author(s):

Magdalena Woinska ◽

Monika Wanat ◽

Przemyslaw Taciak ◽

Tomasz Pawinski ◽

Wladek Minor ◽

...

Keyword(s):

Data Quality ◽

Electron Density ◽

Intermolecular Interactions ◽

Thermal Motion ◽

Quality Data ◽

Data Sets ◽

Quinoline Derivatives ◽

Hydrogen Atoms ◽

Data Set ◽

X Ray

In this work, two methods of high-resolution X-ray data refinement: multipole refinement (MM) and Hirshfeld atom refinement (HAR) – together with X-ray wavefunction refinement (XWR) – are applied to investigate the refinement of positions and anisotropic thermal motion of hydrogen atoms, experiment-based reconstruction of electron density, refinement of anharmonic thermal vibrations, as well as the effects of excluding the weakest reflections in the refinement. The study is based on X-ray data sets of varying quality collected for the crystals of four quinoline derivatives with Cl, Br, I atoms and the -S-Ph group as substituents. Energetic investigations are performed, comprising the calculation of the energy of intermolecular interactions, cohesive and geometrical relaxation energy. The results obtained for experimentally derived structures are verified against the values calculated for structures optimized using dispersion-corrected periodic density functional theory. For the high-quality data sets (the Cl and -S-Ph compounds), both MM and XWR could be successfully used to refine the atomic displacement parameters and the positions of hydrogen atoms; however, the bond lengths obtained with XWR were more precise and closer to the theoretical values. In the application to the more challenging data sets (the Br and I compounds), only XWR enabled free refinement of hydrogen atom geometrical parameters, nevertheless, the results clearly showed poor data quality. For both refinement methods, the energy values (intermolecular interactions, cohesive and relaxation) calculated for the experimental structures were in similar agreement with the values associated with the optimized structures – the most significant divergences were observed when experimental geometries were biased by poor data quality. XWR was found to be more robust in avoiding incorrect distortions of the reconstructed electron density as a result of data quality issues. Based on the problem of anharmonic thermal motion refinement, this study reveals that for the most correct interpretation of the obtained results, it is necessary to use the complete data set, including the weak reflections in order to draw conclusions.

Download Full-text

Designing Information Product (IP) Maps On the Process of Data Processing and Academic Information

International Journal of New Media Technology ◽

10.31937/ijnmt.v4i1.534 ◽

2017 ◽

Vol 4 (1) ◽

pp. 25-31 ◽

Cited By ~ 1

Author(s):

Diana Effendi

Keyword(s):

Data Quality ◽

Data Management ◽

Information Management ◽

Information Quality ◽

Quality Data ◽

Management Approach ◽

Quality Of Data ◽

Information Product ◽

Academic Activities

Information Product Approach (IP Approach) is an information management approach. It can be used to manage product information and data quality analysis. IP-Map can be used by organizations to facilitate the management of knowledge in collecting, storing, maintaining, and using the data in an organized. The process of data management of academic activities in X University has not yet used the IP approach. X University has not given attention to the management of information quality of its. During this time X University just concern to system applications used to support the automation of data management in the process of academic activities. IP-Map that made in this paper can be used as a basis for analyzing the quality of data and information. By the IP-MAP, X University is expected to know which parts of the process that need improvement in the quality of data and information management. Index term: IP Approach, IP-Map, information quality, data quality. REFERENCES[1] H. Zhu, S. Madnick, Y. Lee, and R. Wang, “Data and Information Quality Research: Its Evolution and Future,” Working Paper, MIT, USA, 2012.[2] Lee, Yang W; at al, Journey To Data Quality, MIT Press: Cambridge, 2006.[3] L. Al-Hakim, Information Quality Management: Theory and Applications. Idea Group Inc (IGI), 2007.[4] “Access : A semiotic information quality framework: development and comparative analysis : Journal ofInformation Technology.” [Online]. Available: http://www.palgravejournals.com/jit/journal/v20/n2/full/2000038a.html. [Accessed: 18-Sep-2015].[5] Effendi, Diana, Pengukuran Dan Perbaikan Kualitas Data Dan Informasi Di Perguruan Tinggi MenggunakanCALDEA Dan EVAMECAL (Studi Kasus X University), Proceeding Seminar Nasional RESASTEK, 2012, pp.TIG.1-TI-G.6.

Download Full-text

Proteomics: Opportunities and Challenges

International Journal of Pharmaceutical Sciences and Nanotechnology ◽

10.37285/ijpsn.2010.3.4.1 ◽

2011 ◽

Vol 3 (4) ◽

pp. 1165-1172 ◽

Cited By ~ 1

Author(s):

Parag A Pathade ◽

Vinod A Bairagi ◽

Yogesh S. Ahire ◽

Neela M Bhatia

Keyword(s):

Biomedical Research ◽

Therapy Monitoring ◽

Dynamic State ◽

Data Sets ◽

Protein Markers ◽

Drug Target Discovery ◽

Proteomic Data ◽

Cell Tissue ◽

A Cell ◽

Analytical Tools

‘‘Proteomics’’, is the emerging technology leading to high-throughput identification and understanding of proteins. Proteomics is the protein equivalent of genomics and has captured the imagination of biomolecular scientists, worldwide. Because proteome reveals more accurately the dynamic state of a cell, tissue, or organism, much is expected from proteomics to indicate better disease markers for diagnosis and therapy monitoring. Proteomics is expected to play a major role in biomedical research, and it will have a significant impact on the development of diagnostics and therapeutics for cancer, heart ailments and infectious diseases, in future. Proteomics research leads to the identification of new protein markers for diagnostic purposes and novel molecular targets for drug discovery. Though the potential is great, many challenges and issues remain to be solved, such as gene expression, peptides, generation of low abundant proteins, analytical tools, drug target discovery and cost. A systematic and efficient analysis of vast genomic and proteomic data sets is a major challenge for researchers, today. Nevertheless, proteomics is the groundwork for constructing and extracting useful comprehension to biomedical research. This review article covers some opportunities and challenges offered by proteomics.

Download Full-text

Feature and Language Selection in Temporal Symbolic Regression for Interpretable Air Quality Modelling

Algorithms ◽

10.3390/a14030076 ◽

2021 ◽

Vol 14 (3) ◽

pp. 76

Author(s):

Estrella Lucena-Sánchez ◽

Guido Sciavicco ◽

Ionel Eduard Stan

Keyword(s):

Air Quality ◽

Fundamental Problem ◽

Multivariate Time Series ◽

Quality Data ◽

Data Sets ◽

Air Quality Modelling ◽

Regression Problem ◽

Car Traffic ◽

Pollutant Concentrations ◽

Language Selection

Air quality modelling that relates meteorological, car traffic, and pollution data is a fundamental problem, approached in several different ways in the recent literature. In particular, a set of such data sampled at a specific location and during a specific period of time can be seen as a multivariate time series, and modelling the values of the pollutant concentrations can be seen as a multivariate temporal regression problem. In this paper, we propose a new method for symbolic multivariate temporal regression, and we apply it to several data sets that contain real air quality data from the city of Wrocław (Poland). Our experiments show that our approach is superior to classical, especially symbolic, ones, both in statistical performances and the interpretability of the results.

Download Full-text

Survey Data Quality in Analyzing Harmonized Indicators of Protest Behavior: A Survey Data Recycling Approach

American Behavioral Scientist ◽

10.1177/00027642211021623 ◽

2021 ◽

pp. 000276422110216

Author(s):

Kazimierz M. Slomczynski ◽

Irina Tomescu-Dubrow ◽

Ilona Wysmulek

Keyword(s):

Data Processing ◽

Data Quality ◽

Survey Data ◽

A Priori ◽

Data Sets ◽

New Approach ◽

Survey Quality ◽

Survey Error ◽

Ex Post ◽

The Impact

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.

Download Full-text

Workshop Synthesis: Sampling Issues, Data Quality & Data Protection

Transportation Research Procedia ◽

10.1016/j.trpro.2015.12.006 ◽

2015 ◽

Vol 11 ◽

pp. 60-65 ◽

Cited By ~ 4

Author(s):

Jimmy Armoogum ◽

Jennifer Dill

Keyword(s):

Data Quality ◽

Data Protection ◽

Quality Data

Download Full-text

Design of Marine Data Warehouse ETL System

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.668-669.1374 ◽

2014 ◽

Vol 668-669 ◽

pp. 1374-1377 ◽

Cited By ~ 1

Author(s):

Wei Jun Wen

Keyword(s):

Decision Making ◽

Data Quality ◽

Data Warehouse ◽

Environmental Data ◽

Data Sources ◽

Quality Data ◽

Critical Step ◽

Marine Data ◽

Data Extracting ◽

Marine Environmental

ETL refers to the process of data extracting, transformation and loading and is deemed as a critical step in ensuring the quality, data specification and standardization of marine environmental data. Marine data, due to their complication, field diversity and huge volume, still remain decentralized, polyphyletic and isomerous with different semantics and hence far from being able to provide effective data sources for decision making. ETL enables the construction of marine environmental data warehouse in the form of cleaning, transformation, integration, loading and periodic updating of basic marine data warehouse. The paper presents a research on rules for cleaning, transformation and integration of marine data, based on which original ETL system of marine environmental data warehouse is so designed and developed. The system further guarantees data quality and correctness in analysis and decision-making based on marine environmental data in the future.

Download Full-text