Crawling the German Health Web: Exploratory Study and Graph Analysis

Richard Zowalla; Thomas Wetter; Daniel Pfeifer

doi:10.2196/17853

Crawling the German Health Web: Exploratory Study and Graph Analysis

Journal of Medical Internet Research ◽

10.2196/17853 ◽

2020 ◽

Vol 22 (7) ◽

pp. e17853

Author(s):

Richard Zowalla ◽

Thomas Wetter ◽

Daniel Pfeifer

Keyword(s):

Health Information ◽

Large Fraction ◽

Support Vector ◽

Web Pages ◽

Data Set ◽

Total N ◽

Harvest Rate ◽

Health Domain ◽

German Health ◽

Health Related

Background The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). Objective This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. Methods A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.

Download Full-text

Crawling the German Health Web: Exploratory Study and Graph Analysis (Preprint)

10.2196/preprints.17853 ◽

2020 ◽

Author(s):

Richard Zowalla ◽

Thomas Wetter ◽

Daniel Pfeifer

Keyword(s):

Health Information ◽

Large Fraction ◽

Support Vector ◽

Web Pages ◽

Data Set ◽

Total N ◽

Harvest Rate ◽

Health Domain ◽

German Health ◽

Health Related

BACKGROUND The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). OBJECTIVE This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. METHODS A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. RESULTS In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. CONCLUSIONS The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.

Download Full-text

Exploring a knowledge-based approach to predicting NACE codes of enterprises based on web page texts

Statistical Journal of the IAOS ◽

10.3233/sji-200675 ◽

2020 ◽

Vol 36 (3) ◽

pp. 807-821

Author(s):

Heidi Kühnemann ◽

Arnout van Delden ◽

Dick Windmeijer

Keyword(s):

Economic Activity ◽

Model Performance ◽

Support Vector ◽

Web Pages ◽

Data Set ◽

Filter Model ◽

Domain Specific ◽

Knowledge Based ◽

Vector Machines

Classification of enterprises by main economic activity according to NACE codes is a challenging but important task for national statistical institutes. Since manual editing is time-consuming, we investigated the automatic prediction from dedicated website texts using a knowledge-based approach. To that end, concept features were derived from a set of domain-specific keywords. Furthermore, we compared flat classification to a specific two-level hierarchy which was based on an approach used by manual editors. We limited ourselves to Naïve Bayes and Support Vector Machines models and only used texts from the main web pages. As a first step, we trained a filter model that classifies whether websites contain information about economic activity. The resulting filtered data set was subsequently used to predict 111 NACE classes. We found that using concept features did not improve the model performance compared to a model with character n-grams, i.e. non-informative features. Neither did the two-level hierarchy improve the performance relative to a flat classification. Nonetheless, prediction of the best three NACE classes clearly improved the overall prediction performance compared to a top-one prediction. We conclude that more effort is needed in order to achieve good results with a knowledge-based approach and discuss ideas for improvement.

Download Full-text

“Less is more”

Online Information Review ◽

10.1108/oir-05-2019-0143 ◽

2019 ◽

Vol 44 (1) ◽

pp. 213-237 ◽

Cited By ~ 2

Author(s):

Ziqi Zhang ◽

Georgica Bors

Keyword(s):

Public Health ◽

Social Media ◽

State Of The Art ◽

Health Conditions ◽

Data Set ◽

Content Type ◽

The Public ◽

Health Domain ◽

User Classification ◽

Health Related

Purpose This work studies automated user classification on Twitter in the public health domain, a task that is essential to many public health-related research works on social media but has not been addressed. The purpose of this paper is to obtain empirical knowledge on how to optimise the classifier performance on this task. Design/methodology/approach A sample of 3,100 Twitter users who tweeted about different health conditions were manually coded into six most common stakeholders. The authors propose new, simple features extracted from the short Twitter profiles of these users, and compare a large set of classification models (including state-of-the-art) that use more complex features and with different algorithms on this data set. Findings The authors show that user classification in the public health domain is a very challenging task, as the best result the authors can obtain on this data set is only 59 per cent in terms of F1 score. Compared to state-of-the-art, the methods can obtain significantly better (10 percentage points in F1 on a “best-against-best” basis) results when using only a small set of 40 features extracted from the short Twitter user profile texts. Originality/value The work is the first to study the different types of users that engage in health-related communication on social media, applicable to a broad range of health conditions rather than specific ones studied in the previous work. The methods are implemented as open source tools, and together with data, are the first of this kind. The authors believe these will encourage future research to further improve this important task.

Download Full-text

The Security State of the German Health Web: An Exploratory Study

10.3233/shti210558 ◽

2021 ◽

Author(s):

Frederic Henn ◽

Richard Zowalla ◽

Andreas Mayer

Keyword(s):

Web Sites ◽

Exploratory Study ◽

Healthcare Providers ◽

Web Pages ◽

Security Vulnerabilities ◽

Face To Face ◽

German Health ◽

Health Related ◽

Future Work ◽

Country Code

The internet has become an important resource for health information and for interactions with healthcare providers. However, information of all types can go through many servers and networks before reaching its intended destination and any of these has the potential to intercept or even manipulate the exchanged information if data’s transfer is not adequately protected. As trust is a fundamental concept in healthcare relationships, it is crucial to offer a secure medical website to maintain the same level of trust as provided in a face-to-face meeting. This study provides a first analysis of the SSL/TLS security of and the security headers used within the health-related web limited to web pages in German, the German health web (GHW). Methods: testssl.sh and TLS-Scanner were used to analyze the URLs of the 1,000 top-ranked health-related web sites (according to PageRank) for each of the country- code top level domains: “.de”, “.at” and “.ch”. Results: Our study revealed that most websites in the GHW are potentially vulnerable to common SSL/TLS security vulnerabilities, offer deprecated SSL/TLS protocol versions and mostly do not implement HTTP security headers at all. Conclusions: These findings question the concept of trust within the GHW. Website owners should reconsider the use of outdated SSL/TLS protocol versions for compatibility reasons. Additionally, HTTP security headers should be implemented more consequently to provide additional security aspects. In future work, the authors intend to repeat this study and to incorporate a website’s category, i.e. governmental or public health, to get a more detailed view of the GHW’s security.

Download Full-text

A Computational Method for the Identification of Endolysins and Autolysins

Protein and Peptide Letters ◽

10.2174/0929866526666191002104735 ◽

2020 ◽

Vol 27 (4) ◽

pp. 329-336 ◽

Cited By ~ 1

Author(s):

Lei Xu ◽

Guangmin Liang ◽

Baowen Chen ◽

Xu Tan ◽

Huaikun Xiang ◽

...

Keyword(s):

Support Vector Machine ◽

Cell Wall ◽

Experimental Results ◽

Computational Method ◽

Lytic Enzyme ◽

Support Vector ◽

Lytic Enzymes ◽

Data Set ◽

Optimal Feature ◽

Better Than

Background: Cell lytic enzyme is a kind of highly evolved protein, which can destroy the cell structure and kill the bacteria. Compared with antibiotics, cell lytic enzyme will not cause serious problem of drug resistance of pathogenic bacteria. Thus, the study of cell wall lytic enzymes aims at finding an efficient way for curing bacteria infectious. Compared with using antibiotics, the problem of drug resistance becomes more serious. Therefore, it is a good choice for curing bacterial infections by using cell lytic enzymes. Cell lytic enzyme includes endolysin and autolysin and the difference between them is the purpose of the break of cell wall. The identification of the type of cell lytic enzymes is meaningful for the study of cell wall enzymes. Objective: In this article, our motivation is to predict the type of cell lytic enzyme. Cell lytic enzyme is helpful for killing bacteria, so it is meaningful for study the type of cell lytic enzyme. However, it is time consuming to detect the type of cell lytic enzyme by experimental methods. Thus, an efficient computational method for the type of cell lytic enzyme prediction is proposed in our work. Method: We propose a computational method for the prediction of endolysin and autolysin. First, a data set containing 27 endolysins and 41 autolysins is built. Then the protein is represented by tripeptides composition. The features are selected with larger confidence degree. At last, the classifier is trained by the labeled vectors based on support vector machine. The learned classifier is used to predict the type of cell lytic enzyme. Results: Following the proposed method, the experimental results show that the overall accuracy can attain 97.06%, when 44 features are selected. Compared with Ding's method, our method improves the overall accuracy by nearly 4.5% ((97.06-92.9)/92.9%). The performance of our proposed method is stable, when the selected feature number is from 40 to 70. The overall accuracy of tripeptides optimal feature set is 94.12%, and the overall accuracy of Chou's amphiphilic PseAAC method is 76.2%. The experimental results also demonstrate that the overall accuracy is improved by nearly 18% when using the tripeptides optimal feature set. Conclusion: The paper proposed an efficient method for identifying endolysin and autolysin. In this paper, support vector machine is used to predict the type of cell lytic enzyme. The experimental results show that the overall accuracy of the proposed method is 94.12%, which is better than some existing methods. In conclusion, the selected 44 features can improve the overall accuracy for identification of the type of cell lytic enzyme. Support vector machine performs better than other classifiers when using the selected feature set on the benchmark data set.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Rational Design of Colchicine Derivatives as anti-HIV Agents via QSAR and Molecular Docking

Medicinal Chemistry ◽

10.2174/1573406414666180924163756 ◽

2019 ◽

Vol 15 (4) ◽

pp. 328-340 ◽

Cited By ~ 3

Author(s):

Apilak Worachartcheewan ◽

Napat Songtawee ◽

Suphakit Siriwong ◽

Supaluk Prachayasittikul ◽

Chanin Nantasenamat ◽

...

Keyword(s):

Molecular Docking ◽

Rational Design ◽

External Validation ◽

Rational Drug Design ◽

Support Vector ◽

Data Set ◽

Qsar Models ◽

Anti Hiv Agents ◽

Anti Hiv ◽

Colchicine Derivatives

Background: Human immunodeficiency virus (HIV) is an infective agent that causes an acquired immunodeficiency syndrome (AIDS). Therefore, the rational design of inhibitors for preventing the progression of the disease is required. Objective: This study aims to construct quantitative structure-activity relationship (QSAR) models, molecular docking and newly rational design of colchicine and derivatives with anti-HIV activity. Methods: A data set of 24 colchicine and derivatives with anti-HIV activity were employed to develop the QSAR models using machine learning methods (e.g. multiple linear regression (MLR), artificial neural network (ANN) and support vector machine (SVM)), and to study a molecular docking. Results: The significant descriptors relating to the anti-HIV activity included JGI2, Mor24u, Gm and R8p+ descriptors. The predictive performance of the models gave acceptable statistical qualities as observed by correlation coefficient (Q2) and root mean square error (RMSE) of leave-one out cross-validation (LOO-CV) and external sets. Particularly, the ANN method outperformed MLR and SVM methods that displayed LOO−CV 2 Q and RMSELOO-CV of 0.7548 and 0.5735 for LOOCV set, and Ext 2 Q of 0.8553 and RMSEExt of 0.6999 for external validation. In addition, the molecular docking of virus-entry molecule (gp120 envelope glycoprotein) revealed the key interacting residues of the protein (cellular receptor, CD4) and the site-moiety preferences of colchicine derivatives as HIV entry inhibitors for binding to HIV structure. Furthermore, newly rational design of colchicine derivatives using informative QSAR and molecular docking was proposed. Conclusion: These findings serve as a guideline for the rational drug design as well as potential development of novel anti-HIV agents.

Download Full-text

QSAR Study of PARP Inhibitors by GA-MLR, GA-SVM and GA-ANN Approaches

Current Analytical Chemistry ◽

10.2174/1573411016999200518083359 ◽

2020 ◽

Vol 16 (8) ◽

pp. 1088-1105

Author(s):

Nafiseh Vahedi ◽

Majid Mohammadhosseini ◽

Mehdi Nekoei

Keyword(s):

Present Report ◽

Principal Component ◽

Parp Inhibitors ◽

Support Vector ◽

Ann Model ◽

Statistical Parameters ◽

Qsar Study ◽

Data Set ◽

Test Set ◽

Non Linear

Background: The poly(ADP-ribose) polymerases (PARP) is a nuclear enzyme superfamily present in eukaryotes. Methods: In the present report, some efficient linear and non-linear methods including multiple linear regression (MLR), support vector machine (SVM) and artificial neural networks (ANN) were successfully used to develop and establish quantitative structure-activity relationship (QSAR) models capable of predicting pEC50 values of tetrahydropyridopyridazinone derivatives as effective PARP inhibitors. Principal component analysis (PCA) was used to a rational division of the whole data set and selection of the training and test sets. A genetic algorithm (GA) variable selection method was employed to select the optimal subset of descriptors that have the most significant contributions to the overall inhibitory activity from the large pool of calculated descriptors. Results: The accuracy and predictability of the proposed models were further confirmed using crossvalidation, validation through an external test set and Y-randomization (chance correlations) approaches. Moreover, an exhaustive statistical comparison was performed on the outputs of the proposed models. The results revealed that non-linear modeling approaches, including SVM and ANN could provide much more prediction capabilities. Conclusion: Among the constructed models and in terms of root mean square error of predictions (RMSEP), cross-validation coefficients (Q2 LOO and Q2 LGO), as well as R2 and F-statistical value for the training set, the predictive power of the GA-SVM approach was better. However, compared with MLR and SVM, the statistical parameters for the test set were more proper using the GA-ANN model.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Comparison of Spectroscopic Techniques Combined with Chemometrics for Cocaine Powder Analysis

Journal of Analytical Toxicology ◽

10.1093/jat/bkaa101 ◽

2020 ◽

Vol 44 (8) ◽

pp. 851-860

Author(s):

Joy Eliaerts ◽

Natalie Meert ◽

Pierre Dardenne ◽

Vincent Baeten ◽

Juan-Antonio Fernandez Pierna ◽

...

Keyword(s):

Gas Chromatography ◽

Near Infrared ◽

Evaluation Criteria ◽

Classification Model ◽

Support Vector ◽

Spectroscopic Techniques ◽

Data Set ◽

Promising Tool ◽

Powder Analysis ◽

Mir Spectra

Abstract Spectroscopic techniques combined with chemometrics are a promising tool for analysis of seized drug powders. In this study, the performance of three spectroscopic techniques [Mid-InfraRed (MIR), Raman and Near-InfraRed (NIR)] was compared. In total, 364 seized powders were analyzed and consisted of 276 cocaine powders (with concentrations ranging from 4 to 99 w%) and 88 powders without cocaine. A classification model (using Support Vector Machines [SVM] discriminant analysis) and a quantification model (using SVM regression) were constructed with each spectral dataset in order to discriminate cocaine powders from other powders and quantify cocaine in powders classified as cocaine positive. The performances of the models were compared with gas chromatography coupled with mass spectrometry (GC–MS) and gas chromatography with flame-ionization detection (GC–FID). Different evaluation criteria were used: number of false negatives (FNs), number of false positives (FPs), accuracy, root mean square error of cross-validation (RMSECV) and determination coefficients (R2). Ten colored powders were excluded from the classification data set due to fluorescence background observed in Raman spectra. For the classification, the best accuracy (99.7%) was obtained with MIR spectra. With Raman and NIR spectra, the accuracy was 99.5% and 98.9%, respectively. For the quantification, the best results were obtained with NIR spectra. The cocaine content was determined with a RMSECV of 3.79% and a R2 of 0.97. The performance of MIR and Raman to predict cocaine concentrations was lower than NIR, with RMSECV of 6.76% and 6.79%, respectively and both with a R2 of 0.90. The three spectroscopic techniques can be applied for both classification and quantification of cocaine, but some differences in performance were detected. The best classification was obtained with MIR spectra. For quantification, however, the RMSECV of MIR and Raman was twice as high in comparison with NIR. Spectroscopic techniques combined with chemometrics can reduce the workload for confirmation analysis (e.g., chromatography based) and therefore save time and resources.

Download Full-text