scholarly journals Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

2021 ◽  
Vol 11 (8) ◽  
pp. 3319
Author(s):  
Kiril Griazev ◽  
Simona Ramanauskaitė

The need for automated data extraction is continuously growing due to the constant addition of information to the worldwide web. Researchers are developing new data extraction methods to achieve increased performance compared to existing methods. Comparing algorithms to evaluate their performance is vital when developing new solutions. Different algorithms require different datasets to test their performance due to the various data extraction approaches. Currently, most datasets tend to focus on a specific data extraction approach. Thus, they generally lack the data that may be useful for other extraction methods. That leads to difficulties when comparing the performance of algorithms that are vastly different in their approach. We propose a dataset of web page content blocks that includes various data points to counter this. We also validate its design and structure by performing block labeling experiments. Web developers of varying experience levels labeled multiple websites presented to them. Their labeling results were stored in the newly proposed dataset structure. The experiment proved the need for proposed data points and validated dataset structure suitability for multi-purpose dataset design.

2020 ◽  
pp. 5-9
Author(s):  
Manasvi Srivastava ◽  
◽  
Vikas Yadav ◽  
Swati Singh ◽  
◽  
...  

The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set of strategies here in which we get information from the website instead of copying the data manually. Many Web-based data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software. Web-based software such as YahooPipes, Google Web Scrapers and Firefox extensions for Outwit are the best tools for beginners in web cutting. Web extraction is basically used to cut this manual extraction and editing process and provide an easy and better way to collect data from a web page and convert it into the desired format and save it to a local or archive directory. In this paper, among others the kind of scrub, we focus on those techniques that extract the content of a Web page. In particular, we use scrubbing techniques for a variety of diseases with their own symptoms and precautions.


2018 ◽  
Author(s):  
Jordan Carlson ◽  
J. Aaron Hipp ◽  
Jacqueline Kerr ◽  
Todd Horowitz ◽  
David Berrigan

BACKGROUND Image based data collection for obesity research is in its infancy. OBJECTIVE The present study aimed to document challenges to and benefits from such research by capturing examples of research involving the use of images to assess physical activity- or nutrition-related behaviors and/or environments. METHODS Researchers (i.e., key informants) using image capture in their research were identified through knowledge and networks of the authors of this paper and through literature search. Twenty-nine key informants completed a survey covering the type of research, source of images, and challenges and benefits experienced, developed specifically for this study. RESULTS Most respondents used still images in their research, with only 26.7% using video. Image sources were categorized as participant generated (N = 13; e.g., participants using smartphones for dietary assessment), researcher generated (N = 10; e.g., wearable cameras with automatic image capture), or curated from third parties (N = 7; e.g., Google Street View). Two of the major challenges that emerged included the need for automated processing of large datasets (58.8%) and participant recruitment/compliance (41.2%). Benefit-related themes included greater perspectives on obesity with increased data coverage (34.6%) and improved accuracy of behavior and environment assessment (34.6%). CONCLUSIONS Technological advances will support the increased use of images in the assessment of physical activity, nutrition behaviors, and environments. To advance this area of research, more effective collaborations are needed between health and computer scientists. In particular development of automated data extraction methods for diverse aspects of behavior, environment, and food characteristics are needed. Additionally, progress in standards for addressing ethical issues related to image capture for research purposes are critical. CLINICALTRIAL NA


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Irvin Dongo ◽  
Yudith Cardinale ◽  
Ana Aguilera ◽  
Fabiola Martinez ◽  
Yuni Quintero ◽  
...  

Purpose This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations. Design/methodology/approach As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods. Findings The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web. Originality/value Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.


2018 ◽  
Vol 41 (1) ◽  
pp. 125-144 ◽  
Author(s):  
Rebecca Campbell ◽  
Rachael Goodman-Williams ◽  
Hannah Feeney ◽  
Giannina Fehler-Cabral

The purpose of this study was to develop triangulation coding methods for a large-scale action research and evaluation project and to examine how practitioners and policy makers interpreted both convergent and divergent data. We created a color-coded system that evaluated the extent of triangulation across methodologies (qualitative and quantitative), data collection methods (observations, interviews, and archival records), and stakeholder groups (five distinct disciplines/organizations). Triangulation was assessed for both specific data points (e.g., a piece of historical/contextual information or qualitative theme) and substantive findings that emanated from further analysis of those data points (e.g., a statistical model or a mechanistic qualitative assertion that links themes). We present five case study examples that explore the complexities of interpreting triangulation data and determining whether data are deemed credible and actionable if not convergent.


2021 ◽  
Author(s):  
Liam Rose ◽  
Linda Diem Tran ◽  
Steven M Asch ◽  
Anita Vashi

Objective: To examine how VA shifted care delivery methods one year into the pandemic. Study Setting: All encounters paid or provided by VA between January 1, 2019 and February 27, 2021. Study Design: We aggregated all VA paid or provided encounters and classified them into community (non-VA) acute and non-acute visits, VA acute and non-acute visits, and VA virtual visits. We then compared the number of encounters by week over time to pre-pandemic levels. Data Extraction Methods: Aggregation of administrative VA claims and health records. Principal Findings: VA has experienced a dramatic and persistent shift to providing virtual care and purchasing care from non-VA providers. Before the pandemic, a majority (63%) of VA care was provided in-person at a VA facility. One year into the pandemic, in-person care at VA's constituted just 33% of all visits. Most of the difference made up by large expansions of virtual care; total VA provided visits (in person and virtual) declined (4.9 million to 4.2 million) while total visits of all types declined only 3.5%. Community provided visits exceeded prepandemic levels (2.3 million to 2.9 million, +26%). Conclusion: Unlike private health care, VA has resumed in-person care slowly at its own facilities, and more rapidly in purchased care with different financial incentives a likely driver. The very large expansion of virtual care nearly made up the difference. With a widespread physical presence across the U.S., this has important implications for access to care and future allocation of medical personnel, facilities, and resources.


2020 ◽  
Vol 117 ◽  
pp. 158-164 ◽  
Author(s):  
Livia Puljak ◽  
Nicoletta Riva ◽  
Elena Parmelli ◽  
Marien González-Lorenzo ◽  
Lorenzo Moja ◽  
...  

2013 ◽  
Vol 718-720 ◽  
pp. 2359-2364
Author(s):  
Zhao Hui Cai

The uptake of semantic technology depends on the availability of useful tools that enable Web developers to generate linked course data automaticly. RDF triple allows web page to contain machine-readable content that is easier to find and mashable with other content. This paper describes a framework that turns this idea around, using RDF as a template language for the generation of machine-readable triple from human-readable data on Web page. Most existing methords generate RDF triple by combining the template with query results from a relational database. In the Linked Course Data Generating framework, the raw course data is turned into RDF triple, then is turned into linked data, finally is turned into ontology. This paper evaluates the performance of framework.


Sign in / Sign up

Export Citation Format

Share Document