Combining Insights of Medical Readability Tools and Machine Learning for Reader-oriented Health Resource Evaluation (Preprint)
BACKGROUND Medical texts on the websites are rich resources for the general public to access health information and get advice to assist them with their health concerns. However, the reading comprehension required for this type of information is far more complex than just reading the text alone, because it often requires a high health knowledge or health literacy in the domain-specific disease area. Furthermore, the reading ability of an individual is also influenced by others factors such as literacy, age, morbidities, social-economic status, interest in a specific health topic, cultural and linguistic background. Literature suggests that traditional readability formulas were designed to give one score for all readers. This inevitably urges for a more adaptive readability assessment tools to evaluate online medical information for people with various backgrounds in a much more comprehensive way. OBJECTIVE The aim of this study was to clarify the existing controversy around the inconsistency among readability formulas, and to build a reader-oriented readability assessment tool, which could automatically estimate the readability of online health information in considering the diverse backgrounds from readers. METHODS The aim of this study was to clarify the existing controversy around the inconsistency among readability formulas, and to build a reader-oriented readability assessment tool, which could automatically estimate the readability of online health information in considering the diverse backgrounds from readers. RESULTS We found that the machine learning readability models integrating multiple readability formulas were more effective to estimate readability of online infectious disease information than the individual readability formula alone. The integrated machine-learning models incorporated the features from the readability formulas, while considered specific backgrounds of readers, which resulted in a more superior performance in the readability classification. CONCLUSIONS The empirical study combined with the existing readability formulas and the machine-learning techniques resulted in more accurate prediction of reading difficulties extended beyond the linguistic features originated from the readability formulas. The proposed assessment tool provides a reader-oriented assessment to be more effective in proxy the health information readability. The key significance of the study includes its reader centeredness, which incorporated the diverse backgrounds from the readers, and its clarification of the relative effectiveness and compatibility of different medical readability tools via machine learning.