Exploring Data Science Jobs with Web Scraping and Text Mining

2015 ◽  
pp. 481-530
2020 ◽  
Vol 5 (4) ◽  
pp. 43-55
Author(s):  
Gianpiero Bianchi ◽  
Renato Bruni ◽  
Cinzia Daraio ◽  
Antonio Laureti Palma ◽  
Giulio Perani ◽  
...  

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).


Author(s):  
Bethany Percha

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


Web Services ◽  
2019 ◽  
pp. 728-744 ◽  
Author(s):  
Antonino Virgillito ◽  
Federico Polidoro

Following the advent of Big Data, statistical offices have been largely exploring the use of Internet as data source for modernizing their data collection process. Particularly, prices are collected online in several statistical institutes through a technique known as web scraping. The objective of the chapter is to discuss the challenges of web scraping for setting up a continuous data collection process, exploring and classifying the more widespread techniques and presenting how they are used in practical cases. The main technical notions behind web scraping are presented and explained in order to give also to readers with no background in IT the sufficient elements to fully comprehend scraping techniques, promoting the building of mixed skills that is at the core of the spirit of modern data science. Challenges for official statistics deriving from the use of web scraping are briefly sketched. Finally, research ideas for overcoming the limitations of current techniques are presented and discussed.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S480-S480
Author(s):  
Robert Lucero ◽  
Ragnhildur Bjarnadottir

Abstract Two hundred and fifty thousand older adults die annually in United States hospitals because of iatrogenic conditions (ICs). Clinicians, aging experts, patient advocates and federal policy makers agree that there is a need to enhance the safety of hospitalized older adults through improved identification and prevention of ICs. To this end, we are building a research program with the goal of enhancing the safety of hospitalized older adults by reducing ICs through an effective learning health system. Leveraging unique electronic data and healthcare system and human resources at the University of Florida, we are applying a state-of-the-art practice-based data science approach to identify risk factors of ICs (e.g., falls) from structured (i.e., nursing, clinical, administrative) and unstructured or text (i.e., registered nurse’s progress notes) data. Our interdisciplinary academic-clinical partnership includes scientific and clinical experts in patient safety, care quality, health outcomes, nursing and health informatics, natural language processing, data science, aging, standardized terminology, clinical decision support, statistics, machine learning, and hospital operations. Results to date have uncovered previously unknown fall risk factors within nursing (i.e., physical therapy initiation), clinical (i.e., number of fall risk increasing drugs, hemoglobin level), and administrative (i.e., Charlson Comorbidity Index, nurse skill mix, and registered nurse staffing ratio) structured data as well as patient cognitive, environmental, workflow, and communication factors in text data. The application of data science methods (i.e., machine learning and text-mining) and findings from this research will be used to develop text-mining pipelines to support sustained data-driven interdisciplinary aging studies to reduce ICs.


2021 ◽  
Vol 7 (3) ◽  
pp. a1en
Author(s):  
Marcello Tenorio de Farias ◽  
Alan César Belo Angeluci ◽  
Brasilina Passarelli

With the spread of access and use of information through the web and social networks, information retrieval in large volumes of data has become unfeasible by manual methods. In this applied study, the contribution of the development and use of a prototype tool for automatic data scraping from online evaluations made on Google Maps – Discovery Stars – was reported. The retrieved data allowed us to investigate how these assessments can have the potential to influence the behavior of the platform's users. Among the results, it was observed that the reading and posting of reviews impact the formation of opinion and motivations of Google Maps users.  


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Zhiyong Zhang ◽  

Data science has maintained its popularity for about 20 years. This study adopts a bottom-up approach to understand what data science is by analyzing the descriptions of courses offered by the data science programs in the United States. Through topic modeling, 14 topics are identified from the current curricula of 56 data science programs. These topics reiterate that data science is at the intersection of statistics, computer science, and substantive fields.


Sign in / Sign up

Export Citation Format

Share Document