scholarly journals How to Inspect and Measure Data Quality about Scientific Publications: Use Case of Wikipedia and CRIS Databases

Algorithms ◽  
2020 ◽  
Vol 13 (5) ◽  
pp. 107 ◽  
Author(s):  
Otmane Azeroual ◽  
Włodzimierz Lewoniewski

The quality assurance of publication data in collaborative knowledge bases and in current research information systems (CRIS) becomes more and more relevant by the use of freely available spatial information in different application scenarios. When integrating this data into CRIS, it is necessary to be able to recognize and assess their quality. Only then is it possible to compile a result from the available data that fulfills its purpose for the user, namely to deliver reliable data and information. This paper discussed the quality problems of source metadata in Wikipedia and CRIS. Based on real data from over 40 million Wikipedia articles in various languages, we performed preliminary quality analysis of the metadata of scientific publications using a data quality tool. So far, no data quality measurements have been programmed with Python to assess the quality of metadata from scientific publications in Wikipedia and CRIS. With this in mind, we programmed the methods and algorithms as code, but presented it in the form of pseudocode in this paper to measure the quality related to objective data quality dimensions such as completeness, correctness, consistency, and timeliness. This was prepared as a macro service so that the users can use the measurement results with the program code to make a statement about their scientific publications metadata so that the management can rely on high-quality data when making decisions.

Sensors ◽  
2019 ◽  
Vol 19 (13) ◽  
pp. 2927
Author(s):  
Zihao Shao ◽  
Huiqiang Wang ◽  
Guangsheng Feng

Mobile crowdsensing (MCS) is a way to use social resources to solve high-precision environmental awareness problems in real time. Publishers hope to collect as much sensed data as possible at a relatively low cost, while users want to earn more revenue at a low cost. Low-quality data will reduce the efficiency of MCS and lead to a loss of revenue. However, existing work lacks research on the selection of user revenue under the premise of ensuring data quality. In this paper, we propose a Publisher-User Evolutionary Game Model (PUEGM) and a revenue selection method to solve the evolutionary stable equilibrium problem based on non-cooperative evolutionary game theory. Firstly, the choice of user revenue is modeled as a Publisher-User Evolutionary Game Model. Secondly, based on the error-elimination decision theory, we combine a data quality assessment algorithm in the PUEGM, which aims to remove low-quality data and improve the overall quality of user data. Finally, the optimal user revenue strategy under different conditions is obtained from the evolutionary stability strategy (ESS) solution and stability analysis. In order to verify the efficiency of the proposed solutions, extensive experiments using some real data sets are conducted. The experimental results demonstrate that our proposed method has high accuracy of data quality assessment and a reasonable selection of user revenue.


2017 ◽  
Vol 4 (1) ◽  
pp. 25-31 ◽  
Author(s):  
Diana Effendi

Information Product Approach (IP Approach) is an information management approach. It can be used to manage product information and data quality analysis. IP-Map can be used by organizations to facilitate the management of knowledge in collecting, storing, maintaining, and using the data in an organized. The  process of data management of academic activities in X University has not yet used the IP approach. X University has not given attention to the management of information quality of its. During this time X University just concern to system applications used to support the automation of data management in the process of academic activities. IP-Map that made in this paper can be used as a basis for analyzing the quality of data and information. By the IP-MAP, X University is expected to know which parts of the process that need improvement in the quality of data and information management.   Index term: IP Approach, IP-Map, information quality, data quality. REFERENCES[1] H. Zhu, S. Madnick, Y. Lee, and R. Wang, “Data and Information Quality Research: Its Evolution and Future,” Working Paper, MIT, USA, 2012.[2] Lee, Yang W; at al, Journey To Data Quality, MIT Press: Cambridge, 2006.[3] L. Al-Hakim, Information Quality Management: Theory and Applications. Idea Group Inc (IGI), 2007.[4] “Access : A semiotic information quality framework: development and comparative analysis : Journal ofInformation Technology.” [Online]. Available: http://www.palgravejournals.com/jit/journal/v20/n2/full/2000038a.html. [Accessed: 18-Sep-2015].[5] Effendi, Diana, Pengukuran Dan Perbaikan Kualitas Data Dan Informasi Di Perguruan Tinggi MenggunakanCALDEA Dan EVAMECAL (Studi Kasus X University), Proceeding Seminar Nasional RESASTEK, 2012, pp.TIG.1-TI-G.6.


2018 ◽  
pp. 57-62
Author(s):  
E. I. Gundrova ◽  
A. P. Lukyanov ◽  
A. V. Pruglo ◽  
S. S. Ravdin

Previously, the authors have proposed a generalized model for estimating the distribution law parameters of luminosity of space objects, assuming that not only successful but also unsuccessful measurement results are taken into account. Estimation was done on the data of observations under similar conditions: phase angle, range, sensibility of the telescope. The algorithm under such limitations was tested on model data and real measurements. Therefore, obtained results showed that algorithm did not fit for cases of changing range of space objects. In this work, the new algorithm, that allows to merge information from different ranges to the observed space object, is proposed. In this case, luminosity values are reduced to the ones at a reference distance of 1000 km considering sensibility of the telescope. To obtain estimates of the parameters the Cramer-Mises-Smirnov criterion is used. This algorithm was tested on model data and results of its work on real data were obtained. The data showed correct work of the algorithm and also confirmed the practicability of organization the registration of unsuccessful measurements.


2021 ◽  
Vol 11 (2) ◽  
pp. 472
Author(s):  
Hyeongmin Cho ◽  
Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.


2014 ◽  
Vol 668-669 ◽  
pp. 1374-1377 ◽  
Author(s):  
Wei Jun Wen

ETL refers to the process of data extracting, transformation and loading and is deemed as a critical step in ensuring the quality, data specification and standardization of marine environmental data. Marine data, due to their complication, field diversity and huge volume, still remain decentralized, polyphyletic and isomerous with different semantics and hence far from being able to provide effective data sources for decision making. ETL enables the construction of marine environmental data warehouse in the form of cleaning, transformation, integration, loading and periodic updating of basic marine data warehouse. The paper presents a research on rules for cleaning, transformation and integration of marine data, based on which original ETL system of marine environmental data warehouse is so designed and developed. The system further guarantees data quality and correctness in analysis and decision-making based on marine environmental data in the future.


2013 ◽  
Vol 427-429 ◽  
pp. 2441-2444
Author(s):  
Wei Chen ◽  
Long Chen ◽  
Ming Li

This paper presents a software design useful for power quality analysis and data management. The software was programmed in LabVIEW and Oracle, running on Windows in a regular PC. LabVIEW acquires data continuously from the lower machine via TCP/IP. Using its database connection toolkit, LabVIEW accesses to Oracle to stores and retrieve the power quality data according to different indicators. A friendly GUI was built for data display and user operation, taking advantage of the powerful data-handling capacity of LabVIEW and its rich controls. Moreover, Excel reports can be exported using report generation toolkit in LabVIEW. The software greatly improves the data analysis and management capacity.


2021 ◽  
Author(s):  
Qing Xie ◽  
Chengong Han ◽  
Victor Jin ◽  
Shili Lin

Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicate things further is the fact that not all zeros are created equal, as some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros), whereas others are indeed due to insufficient sequencing depth (sampling zeros), especially for loci that interact infrequently. Differentiating between structural zeros and sampling zeros is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchy model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values in sampling zeros. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data has led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.


2021 ◽  
Author(s):  
Victoria Leong ◽  
Kausar Raheel ◽  
Sim Jia Yi ◽  
Kriti Kacker ◽  
Vasilis M. Karlaftis ◽  
...  

Background. The global COVID-19 pandemic has triggered a fundamental reexamination of how human psychological research can be conducted both safely and robustly in a new era of digital working and physical distancing. Online web-based testing has risen to the fore as a promising solution for rapid mass collection of cognitive data without requiring human contact. However, a long-standing debate exists over the data quality and validity of web-based studies. Here, we examine the opportunities and challenges afforded by the societal shift toward web-based testing, highlight an urgent need to establish a standard data quality assurance framework for online studies, and develop and validate a new supervised online testing methodology, remote guided testing (RGT). Methods. A total of 85 healthy young adults were tested on 10 cognitive tasks assessing executive functioning (flexibility, memory and inhibition) and learning. Tasks were administered either face-to-face in the laboratory (N=41) or online using remote guided testing (N=44), delivered using identical web-based platforms (CANTAB, Inquisit and i-ABC). Data quality was assessed using detailed trial-level measures (missed trials, outlying and excluded responses, response times), as well as overall task performance measures. Results. The results indicated that, across all measures of data quality and performance, RGT data was statistically-equivalent to data collected in person in the lab. Moreover, RGT participants out-performed the lab group on measured verbal intelligence, which could reflect test environment differences, including possible effects of mask-wearing on communication. Conclusions. These data suggest that the RGT methodology could help to ameliorate concerns regarding online data quality and - particularly for studies involving high-risk or rare cohorts - offer an alternative for collecting high-quality human cognitive data without requiring in-person physical attendance.


Sign in / Sign up

Export Citation Format

Share Document