Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Kiril Griazev; Simona Ramanauskaitė

doi:10.3390/app11083319

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Applied Sciences ◽

10.3390/app11083319 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3319

Author(s):

Kiril Griazev ◽

Simona Ramanauskaitė

Keyword(s):

Data Extraction ◽

Extraction Methods ◽

Structure Validation ◽

Web Page ◽

Specific Data ◽

Experience Levels ◽

Worldwide Web ◽

Data Points ◽

Web Developers ◽

Constant Addition

The need for automated data extraction is continuously growing due to the constant addition of information to the worldwide web. Researchers are developing new data extraction methods to achieve increased performance compared to existing methods. Comparing algorithms to evaluate their performance is vital when developing new solutions. Different algorithms require different datasets to test their performance due to the various data extraction approaches. Currently, most datasets tend to focus on a specific data extraction approach. Thus, they generally lack the data that may be useful for other extraction methods. That leads to difficulties when comparing the performance of algorithms that are vastly different in their approach. We propose a dataset of web page content blocks that includes various data points to counter this. We also validate its design and structure by performing block labeling experiments. Web developers of varying experience levels labeled multiple websites presented to them. Their labeling results were stored in the newly proposed dataset structure. The experiment proved the need for proposed data points and validated dataset structure suitability for multi-purpose dataset design.

Download Full-text

Implementation of Web Application for Disease Prediction Using AI

10.54646/bijdmbd.002 ◽

2020 ◽

pp. 5-9

Author(s):

Manasvi Srivastava ◽

◽

Vikas Yadav ◽

Swati Singh ◽

◽

...

Keyword(s):

Web Application ◽

Ad Hoc ◽

Data Extraction ◽

Extraction Methods ◽

Web Page ◽

Web Based ◽

Web Extraction ◽

Web Scraping ◽

Audio Video ◽

Manual Extraction

The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set of strategies here in which we get information from the website instead of copying the data manually. Many Web-based data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software. Web-based software such as YahooPipes, Google Web Scrapers and Firefox extensions for Outwit are the best tools for beginners in web cutting. Web extraction is basically used to cut this manual extraction and editing process and provide an easy and better way to collect data from a web page and convert it into the desired format and save it to a local or archive directory. In this paper, among others the kind of scrub, we focus on those techniques that extract the content of a Web page. In particular, we use scrubbing techniques for a variety of diseases with their own symptoms and precautions.

Download Full-text

Unique views on obesity-related behaviors and environments: research using still and video images (Preprint)

10.2196/preprints.10648 ◽

2018 ◽

Author(s):

Jordan Carlson ◽

J. Aaron Hipp ◽

Jacqueline Kerr ◽

Todd Horowitz ◽

David Berrigan

Keyword(s):

Physical Activity ◽

Ethical Issues ◽

Data Extraction ◽

Dietary Assessment ◽

Extraction Methods ◽

Image Capture ◽

Still Images ◽

Key Informants ◽

Technological Advances ◽

Computer Scientists

BACKGROUND Image based data collection for obesity research is in its infancy. OBJECTIVE The present study aimed to document challenges to and benefits from such research by capturing examples of research involving the use of images to assess physical activity- or nutrition-related behaviors and/or environments. METHODS Researchers (i.e., key informants) using image capture in their research were identified through knowledge and networks of the authors of this paper and through literature search. Twenty-nine key informants completed a survey covering the type of research, source of images, and challenges and benefits experienced, developed specifically for this study. RESULTS Most respondents used still images in their research, with only 26.7% using video. Image sources were categorized as participant generated (N = 13; e.g., participants using smartphones for dietary assessment), researcher generated (N = 10; e.g., wearable cameras with automatic image capture), or curated from third parties (N = 7; e.g., Google Street View). Two of the major challenges that emerged included the need for automated processing of large datasets (58.8%) and participant recruitment/compliance (41.2%). Benefit-related themes included greater perspectives on obesity with increased data coverage (34.6%) and improved accuracy of behavior and environment assessment (34.6%). CONCLUSIONS Technological advances will support the increased use of images in the assessment of physical activity, nutrition behaviors, and environments. To advance this area of research, more effective collaborations are needed between health and computer scientists. In particular development of automated data extraction methods for diverse aspects of behavior, environment, and food characteristics are needed. Additionally, progress in standards for addressing ethical issues related to image capture for research purposes are critical. CLINICALTRIAL NA

Download Full-text

Automated internal web page clustering for improved data extraction

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics - WIMS '12 ◽

10.1145/2254129.2254209 ◽

2012 ◽

Author(s):

Cornelia Győrödi ◽

Robert Győrödi ◽

Mihai Cornea ◽

George Pecherle

Keyword(s):

Data Extraction ◽

Web Page ◽

Web Page Clustering

Download Full-text

A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis

International Journal of Web Information Systems ◽

10.1108/ijwis-03-2021-0037 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Irvin Dongo ◽

Yudith Cardinale ◽

Ana Aguilera ◽

Fabiola Martinez ◽

Yuni Quintero ◽

...

Keyword(s):

San Francisco ◽

Data Extraction ◽

Qualitative Evaluation ◽

Application Programming Interface ◽

Extraction Methods ◽

Content Type ◽

Qualitative And Quantitative ◽

Advantages And Disadvantages ◽

Web Scraping ◽

Shared Information

Purpose This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations. Design/methodology/approach As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods. Findings The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web. Originality/value Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

Download Full-text

Assessing Triangulation Across Methodologies, Methods, and Stakeholder Groups: The Joys, Woes, and Politics of Interpreting Convergent and Divergent Data

American Journal of Evaluation ◽

10.1177/1098214018804195 ◽

2018 ◽

Vol 41 (1) ◽

pp. 125-144 ◽

Cited By ~ 2

Author(s):

Rebecca Campbell ◽

Rachael Goodman-Williams ◽

Hannah Feeney ◽

Giannina Fehler-Cabral

Keyword(s):

Large Scale ◽

Contextual Information ◽

Policy Makers ◽

Specific Data ◽

Qualitative And Quantitative ◽

Stakeholder Groups ◽

Data Points ◽

Evaluation Project ◽

Collection Methods

The purpose of this study was to develop triangulation coding methods for a large-scale action research and evaluation project and to examine how practitioners and policy makers interpreted both convergent and divergent data. We created a color-coded system that evaluated the extent of triangulation across methodologies (qualitative and quantitative), data collection methods (observations, interviews, and archival records), and stakeholder groups (five distinct disciplines/organizations). Triangulation was assessed for both specific data points (e.g., a piece of historical/contextual information or qualitative theme) and substantive findings that emanated from further analysis of those data points (e.g., a statistical model or a mechanistic qualitative assertion that links themes). We present five case study examples that explore the complexities of interpreting triangulation data and determining whether data are deemed credible and actionable if not convergent.

Download Full-text

The COVID-19 pandemic shifted the Veterans Affairs System toward being a payer and virtual care provider: is it sustainable?

10.1101/2021.05.31.21258031 ◽

2021 ◽

Author(s):

Liam Rose ◽

Linda Diem Tran ◽

Steven M Asch ◽

Anita Vashi

Keyword(s):

Financial Incentives ◽

Care Delivery ◽

Data Extraction ◽

Medical Personnel ◽

Extraction Methods ◽

Virtual Care ◽

Large Expansion ◽

Virtual Visits ◽

The Difference ◽

One Year

Objective: To examine how VA shifted care delivery methods one year into the pandemic. Study Setting: All encounters paid or provided by VA between January 1, 2019 and February 27, 2021. Study Design: We aggregated all VA paid or provided encounters and classified them into community (non-VA) acute and non-acute visits, VA acute and non-acute visits, and VA virtual visits. We then compared the number of encounters by week over time to pre-pandemic levels. Data Extraction Methods: Aggregation of administrative VA claims and health records. Principal Findings: VA has experienced a dramatic and persistent shift to providing virtual care and purchasing care from non-VA providers. Before the pandemic, a majority (63%) of VA care was provided in-person at a VA facility. One year into the pandemic, in-person care at VA's constituted just 33% of all visits. Most of the difference made up by large expansions of virtual care; total VA provided visits (in person and virtual) declined (4.9 million to 4.2 million) while total visits of all types declined only 3.5%. Community provided visits exceeded prepandemic levels (2.3 million to 2.9 million, +26%). Conclusion: Unlike private health care, VA has resumed in-person care slowly at its own facilities, and more rapidly in purchased care with different financial incentives a likely driver. The very large expansion of virtual care nearly made up the difference. With a widespread physical presence across the U.S., this has important implications for access to care and future allocation of medical personnel, facilities, and resources.

Download Full-text

Data extraction methods: an analysis of internal reporting discrepancies in single manuscripts and practical advice

Journal of Clinical Epidemiology ◽

10.1016/j.jclinepi.2019.09.003 ◽

2020 ◽

Vol 117 ◽

pp. 158-164 ◽

Cited By ~ 1

Author(s):

Livia Puljak ◽

Nicoletta Riva ◽

Elena Parmelli ◽

Marien González-Lorenzo ◽

Lorenzo Moja ◽

...

Keyword(s):

Data Extraction ◽

Extraction Methods ◽

Practical Advice ◽

Reporting Discrepancies

Download Full-text

PAA18 DATA EXTRACTION METHODS, ECONOMETRIC MODELING AND FACTORIAL ANALYSIS REVEAL CLINICALLY IMPORTANT PATIENT PROFILES IN CHRONIC ASTHMA

Value in Health ◽

10.1016/s1098-3015(10)68884-5 ◽

2007 ◽

Vol 10 (3) ◽

pp. A114

Author(s):

E Reissell ◽

PJ Palmu ◽

J Schultz ◽

A Pirskanen ◽

M Salonoja ◽

...

Keyword(s):

Data Extraction ◽

Extraction Methods ◽

Factorial Analysis ◽

Econometric Modeling ◽

Chronic Asthma ◽

Patient Profiles

Download Full-text

Generating Linked Course Data

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.718-720.2359 ◽

2013 ◽

Vol 718-720 ◽

pp. 2359-2364

Author(s):

Zhao Hui Cai

Keyword(s):

Relational Database ◽

Linked Data ◽

Web Page ◽

Semantic Technology ◽

Machine Readable ◽

Web Developers

The uptake of semantic technology depends on the availability of useful tools that enable Web developers to generate linked course data automaticly. RDF triple allows web page to contain machine-readable content that is easier to find and mashable with other content. This paper describes a framework that turns this idea around, using RDF as a template language for the generation of machine-readable triple from human-readable data on Web page. Most existing methords generate RDF triple by combining the template with query results from a relational database. In the Linked Course Data Generating framework, the raw course data is turned into RDF triple, then is turned into linked data, finally is turned into ontology. This paper evaluates the performance of framework.

Download Full-text

Data Extraction Methods and their Effects on the Retention of Event Data Contained in the Electronic Control Modules of Detroit Diesel and Mercedes-Benz Engines

SAE International Journal of Passenger Cars - Mechanical Systems ◽

10.4271/2011-01-0808 ◽

2011 ◽

Vol 4 (1) ◽

pp. 636-647 ◽

Cited By ~ 2

Author(s):

David Plant ◽

Timothy Austin ◽

Benjamin Smith

Keyword(s):

Data Extraction ◽

Extraction Methods ◽

Electronic Control ◽

Event Data ◽

Control Modules

Download Full-text