Exploring Data Science Jobs with Web Scraping and Text Mining

Exploring the Potentialities of Automatic Extraction of University Webometric Information

Journal of Data and Information Science ◽

10.2478/jdis-2020-0040 ◽

2020 ◽

Vol 5 (4) ◽

pp. 43-55

Author(s):

Gianpiero Bianchi ◽

Renato Bruni ◽

Cinzia Daraio ◽

Antonio Laureti Palma ◽

Giulio Perani ◽

...

Keyword(s):

Text Mining ◽

Web Mining ◽

Knowledge Extraction ◽

Automatic Extraction ◽

Mining Operations ◽

Automatic Data ◽

Link Type ◽

Web Scraping ◽

University Systems ◽

The Web

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

Download Full-text

Automated Data Collection withR- A Practical Guide to Web Scraping and Text Mining

Journal of Statistical Software ◽

10.18637/jss.v068.b03 ◽

2015 ◽

Vol 68 (Book Review 3) ◽

Cited By ~ 3

Author(s):

Stefano M. Iacus

Keyword(s):

Text Mining ◽

Data Collection ◽

Automated Data Collection ◽

Practical Guide ◽

Web Scraping

Download Full-text

Modern Clinical Text Mining: A Guide and Review

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-030421-030931 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Bethany Percha

Keyword(s):

Machine Learning ◽

Text Mining ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Clinical Text ◽

Quality Improvement Research ◽

Comprehensive Survey ◽

Technical Advances

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Big Data Techniques for Supporting Official Statistics

Web Services ◽

10.4018/978-1-5225-7501-6.ch040 ◽

2019 ◽

pp. 728-744 ◽

Cited By ~ 1

Author(s):

Antonino Virgillito ◽

Federico Polidoro

Keyword(s):

Big Data ◽

Data Collection ◽

Data Science ◽

Official Statistics ◽

The Core ◽

Web Scraping ◽

Collection Process ◽

Data Collection Process ◽

Data Source ◽

Use Of Internet

Following the advent of Big Data, statistical offices have been largely exploring the use of Internet as data source for modernizing their data collection process. Particularly, prices are collected online in several statistical institutes through a technique known as web scraping. The objective of the chapter is to discuss the challenges of web scraping for setting up a continuous data collection process, exploring and classifying the more widespread techniques and presenting how they are used in practical cases. The main technical notions behind web scraping are presented and explained in order to give also to readers with no background in IT the sufficient elements to fully comprehend scraping techniques, promoting the building of mixed skills that is at the core of the spirit of modern data science. Challenges for official statistics deriving from the use of web scraping are briefly sketched. Finally, research ideas for overcoming the limitations of current techniques are presented and discussed.

Download Full-text

ADVANCING AN INTERDISCIPLINARY SCIENCE OF AGING THROUGH A PRACTICE-BASED DATA SCIENCE APPROACH

Innovation in Aging ◽

10.1093/geroni/igz038.1786 ◽

2019 ◽

Vol 3 (Supplement_1) ◽

pp. S480-S480

Author(s):

Robert Lucero ◽

Ragnhildur Bjarnadottir

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Older Adults ◽

Text Mining ◽

Language Processing ◽

Fall Risk ◽

Data Science ◽

Care Quality ◽

Science Approach ◽

Hospitalized Older Adults

Abstract Two hundred and fifty thousand older adults die annually in United States hospitals because of iatrogenic conditions (ICs). Clinicians, aging experts, patient advocates and federal policy makers agree that there is a need to enhance the safety of hospitalized older adults through improved identification and prevention of ICs. To this end, we are building a research program with the goal of enhancing the safety of hospitalized older adults by reducing ICs through an effective learning health system. Leveraging unique electronic data and healthcare system and human resources at the University of Florida, we are applying a state-of-the-art practice-based data science approach to identify risk factors of ICs (e.g., falls) from structured (i.e., nursing, clinical, administrative) and unstructured or text (i.e., registered nurse’s progress notes) data. Our interdisciplinary academic-clinical partnership includes scientific and clinical experts in patient safety, care quality, health outcomes, nursing and health informatics, natural language processing, data science, aging, standardized terminology, clinical decision support, statistics, machine learning, and hospital operations. Results to date have uncovered previously unknown fall risk factors within nursing (i.e., physical therapy initiation), clinical (i.e., number of fall risk increasing drugs, hemoglobin level), and administrative (i.e., Charlson Comorbidity Index, nurse skill mix, and registered nurse staffing ratio) structured data as well as patient cognitive, environmental, workflow, and communication factors in text data. The application of data science methods (i.e., machine learning and text-mining) and findings from this research will be used to develop text-mining pipelines to support sustained data-driven interdisciplinary aging studies to reduce ICs.

Download Full-text

Medical informatics labor market analysis using web crawling, web scraping, and text mining

International Journal of Medical Informatics ◽

10.1016/j.ijmedinf.2021.104453 ◽

2021 ◽

pp. 104453

Author(s):

Jürgen Schedlbauer ◽

Georgios Raptis ◽

Bernd Ludwig

Keyword(s):

Labor Market ◽

Text Mining ◽

Medical Informatics ◽

Market Analysis ◽

Web Crawling ◽

Web Scraping

Download Full-text

SIMONMUNZERT, CHRISTIANRUBBA, PETERMEISSNER, and DOMINICNYHUIS. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Hoboken: Wiley.

Biometrics ◽

10.1111/biom.12830 ◽

2017 ◽

Vol 73 (4) ◽

pp. 1469-1469

Author(s):

Katharina Selig

Keyword(s):

Text Mining ◽

Data Collection ◽

Automated Data Collection ◽

Practical Guide ◽

Web Scraping

Download Full-text

WEB SCRAPING AND DATA SCIENCE IN APPLIED RESEARCH IN COMMUNICATION: a study on online reviews

Revista Observatório ◽

10.20873/uft.2447-4266.2021v7n3a1en ◽

2021 ◽

Vol 7 (3) ◽

pp. a1en

Author(s):

Marcello Tenorio de Farias ◽

Alan César Belo Angeluci ◽

Brasilina Passarelli

Keyword(s):

Data Science ◽

Applied Research ◽

Online Reviews ◽

Google Maps ◽

Automatic Data ◽

Applied Study ◽

Prototype Tool ◽

Web Scraping ◽

Manual Methods ◽

The Web

With the spread of access and use of information through the web and social networks, information retrieval in large volumes of data has become unfeasible by manual methods. In this applied study, the contribution of the development and use of a prototype tool for automatic data scraping from online evaluations made on Google Maps – Discovery Stars – was reported. The retrieved data allowed us to investigate how these assessments can have the potential to influence the behavior of the platform's users. Among the results, it was observed that the reading and posting of reviews impact the formation of opinion and motivations of Google Maps users.

Download Full-text

What is Data Science? An Operational Definition based on Text Mining of Data Science Curricula

Journal of Behavioral Data Science ◽

10.35566/jbds/v1n1/p1 ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Zhiyong Zhang ◽

Keyword(s):

United States ◽

Text Mining ◽

Computer Science ◽

Topic Modeling ◽

Data Science ◽

The United States ◽

Operational Definition ◽

Bottom Up ◽

Science Curricula

Data science has maintained its popularity for about 20 years. This study adopts a bottom-up approach to understand what data science is by analyzing the descriptions of courses offered by the data science programs in the United States. Through topic modeling, 14 topics are identified from the current curricula of 56 data science programs. These topics reiterate that data science is at the intersection of statistics, computer science, and substantive fields.

Download Full-text

Web Scraping and Text Mining of Ukrainian News Articles About Ecology

10.1109/acit52158.2021.9548450 ◽

2021 ◽

Author(s):

Vladyslav Holubiev ◽

Volodymyr Simishko

Keyword(s):

Text Mining ◽

Web Scraping

Download Full-text