Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study

Hyunki Woo; Kyunga Kim; KyeongMin Cha; Jin-Young Lee; Hansong Mun; Soo Jin Cho; Ji In Chung; Jeung Hui Pyo; Kun-Chul Lee; Mira Kang

doi:10.2196/10013

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study

Journal of Medical Internet Research ◽

10.2196/10013 ◽

2019 ◽

Vol 21 (1) ◽

pp. e10013 ◽

Cited By ~ 5

Author(s):

Hyunki Woo ◽

Kyunga Kim ◽

KyeongMin Cha ◽

Jin-Young Lee ◽

Hansong Mun ◽

...

Keyword(s):

Large Scale ◽

Data Cleaning ◽

Text Clustering ◽

Stool Examination ◽

Medical Reports ◽

Efficient Data

Download Full-text

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study (Preprint)

10.2196/preprints.10013 ◽

2018 ◽

Author(s):

Hyunki Woo ◽

Kyunga Kim ◽

KyeongMin Cha ◽

Jin-Young Lee ◽

Hansong Mun ◽

...

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Data Cleaning ◽

Text Clustering ◽

Data Accuracy ◽

Stool Examination ◽

Cleaning Process ◽

Clustering Methods ◽

Text Data ◽

Efficient Data

BACKGROUND Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.

Download Full-text

Selecting Optimal Combination of Data Channels for Semantic Segmentation in City Information Modelling (CIM)

Remote Sensing ◽

10.3390/rs13071367 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1367

Author(s):

Yuanzhi Cai ◽

Hong Huang ◽

Kaiyang Wang ◽

Cheng Zhang ◽

Lei Fan ◽

...

Keyword(s):

Large Scale ◽

Semantic Segmentation ◽

Optimal Combination ◽

Reconstruction Technique ◽

Process Data ◽

Redundant Data ◽

Multiple Data ◽

Information Models ◽

Efficient Data

Over the last decade, a 3D reconstruction technique has been developed to present the latest as-is information for various objects and build the city information models. Meanwhile, deep learning based approaches are employed to add semantic information to the models. Studies have proved that the accuracy of the model could be improved by combining multiple data channels (e.g., XYZ, Intensity, D, and RGB). Nevertheless, the redundant data channels in large-scale datasets may cause high computation cost and time during data processing. Few researchers have addressed the question of which combination of channels is optimal in terms of overall accuracy (OA) and mean intersection over union (mIoU). Therefore, a framework is proposed to explore an efficient data fusion approach for semantic segmentation by selecting an optimal combination of data channels. In the framework, a total of 13 channel combinations are investigated to pre-process data and the encoder-to-decoder structure is utilized for network permutations. A case study is carried out to investigate the efficiency of the proposed approach by adopting a city-level benchmark dataset and applying nine networks. It is found that the combination of IRGB channels provide the best OA performance, while IRGBD channels provide the best mIoU performance.

Download Full-text

Energy Efficient Data Collection in Large-Scale Internet of Things via Computation Offloading

IEEE Internet of Things Journal ◽

10.1109/jiot.2018.2875244 ◽

2019 ◽

Vol 6 (3) ◽

pp. 4176-4187 ◽

Cited By ~ 8

Author(s):

Guorui Li ◽

Jingsha He ◽

Sancheng Peng ◽

Weijia Jia ◽

Cong Wang ◽

...

Keyword(s):

Data Collection ◽

Internet Of Things ◽

Energy Efficient ◽

Large Scale ◽

Computation Offloading ◽

Efficient Data ◽

Efficient Data Collection

Download Full-text

SenCar: An Energy-Efficient Data Gathering Mechanism for Large-Scale Multihop Sensor Networks

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2007.1070 ◽

2007 ◽

Vol 18 (10) ◽

pp. 1476-1488 ◽

Cited By ~ 205

Author(s):

Ming Ma ◽

Yuanyuan Yang

Keyword(s):

Sensor Networks ◽

Energy Efficient ◽

Large Scale ◽

Data Gathering ◽

Efficient Data

Download Full-text

Secure and Privacy Preserving Keyword Search over the Large Scale Cloud Data

Handbook of Research on Modern Cryptographic Solutions for Computer and Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-0105-3.ch009 ◽

2016 ◽

pp. 200-215

Author(s):

Wei Zhang ◽

Jie Wu ◽

Yaping Lin

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Keyword Search ◽

Data Encryption ◽

Privacy Preserving ◽

Security And Privacy ◽

Data Utilization ◽

Cloud Data ◽

Efficient Resource ◽

Efficient Data

Cloud computing has attracted a lot of interests from both the academics and the industries, since it provides efficient resource management, economical cost, and fast deployment. However, concerns on security and privacy become the main obstacle for the large scale application of cloud computing. Encryption would be an alternative way to relief the concern. However, data encryption makes efficient data utilization a challenging problem. To address this problem, secure and privacy preserving keyword search over large scale cloud data is proposed and widely developed. In this paper, we make a thorough survey on the secure and privacy preserving keyword search over large scale cloud data. We investigate existing research arts category by category, where the category is classified according to the search functionality. In each category, we first elaborate on the key idea of existing research works, then we conclude some open and interesting problems.

Download Full-text

An enhanced soft computing-based formulation for secure data aggregation and efficient data processing in large-scale wireless sensor network

Soft Computing ◽

10.1007/s00500-020-04694-1 ◽

2020 ◽

Vol 24 (16) ◽

pp. 12541-12552 ◽

Cited By ~ 1

Author(s):

M. Shobana ◽

R. Sabitha ◽

S. Karthik

Keyword(s):

Wireless Sensor Network ◽

Data Processing ◽

Sensor Network ◽

Soft Computing ◽

Data Aggregation ◽

Large Scale ◽

Wireless Sensor ◽

Secure Data ◽

Efficient Data ◽

Secure Data Aggregation

Download Full-text

Automatic Data Cleaning System for Large-Scale Location Image Databases using a Multilevel Extractor and Multiresolution Dissimilarity Calculation

IEEE Intelligent Systems ◽

10.1109/mis.2020.3021704 ◽

2020 ◽

pp. 1-1

Author(s):

Hsu-Yung Cheng ◽

Chih-Chang Yu

Keyword(s):

Large Scale ◽

Data Cleaning ◽

Image Databases ◽

Automatic Data ◽

Cleaning System

Download Full-text

Efficient data retrieval for large-scale smart city applications through applied Bayesian inference

2015 IEEE Tenth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP) ◽

10.1109/issnip.2015.7106930 ◽

2015 ◽

Cited By ~ 2

Author(s):

Jin Ming Koh ◽

Marcus Sak ◽

Hwee-Xian Tan ◽

Huiguang Liang ◽

Fachmin Folianto ◽

...

Keyword(s):

Bayesian Inference ◽

Smart City ◽

Large Scale ◽

Data Retrieval ◽

Efficient Data

Download Full-text

RMER: Reliable and Energy-Efficient Data Collection for Large-Scale Wireless Sensor Networks

IEEE Internet of Things Journal ◽

10.1109/jiot.2016.2517405 ◽

2016 ◽

Vol 3 (4) ◽

pp. 511-519 ◽

Cited By ~ 117

Author(s):

Mianxiong Dong ◽

Kaoru Ota ◽

Anfeng Liu

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Data Collection ◽

Energy Efficient ◽

Large Scale ◽

Wireless Sensor ◽

Efficient Data ◽

Efficient Data Collection

Download Full-text

Large Scale Text Clustering Method Study Based on MapReduce

Advances in Neural Networks – ISNN 2015 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-25393-0_40 ◽

2015 ◽

pp. 365-372

Author(s):

Zhanquan Sun ◽

Feng Li ◽

Yanling Zhao ◽

Lifeng Song

Keyword(s):

Large Scale ◽

Text Clustering ◽

Clustering Method

Download Full-text