Trends in web data extraction using machine learning

2021 ◽  
pp. 1-22
Author(s):  
Sudhir Kumar Patnaik ◽  
C. Narendra Babu

Web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from web page and documents to automated extraction to an intelligent extraction using machine learning algorithms, tools and techniques. Data extraction is one of the key components of end-to-end life cycle in web data extraction process that includes navigation, extraction, data enrichment and visualization. This paper presents the journey of web data extraction over the last many years highlighting evolution of tools, techniques, frameworks and algorithms for building intelligent web data extraction systems. The paper also throws light into challenges, opportunities for future research and emerging trends over the years in web data extraction with specific focus on machine learning techniques. Both traditional and machine learning approaches to manual and automated web data extraction are experimented and results published with few use cases demonstrating the challenges in web data extraction in the event of changes in the website layout. This paper introduces novel ideas such as self-healing capability in web data extraction and proactive error detection in the event of changes in website layout as an area of future research. This unique perspective will help readers to get deeper insights in to the present and future of web data extraction.

2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Fathima Aliyar Vellameeran ◽  
Thomas Brindha

Abstract Objectives To make a clear literature review on state-of-the-art heart disease prediction models. Methods It reviews 61 research papers and states the significant analysis. Initially, the analysis addresses the contributions of each literature works and observes the simulation environment. Here, different types of machine learning algorithms deployed in each contribution. In addition, the utilized dataset for existing heart disease prediction models was observed. Results The performance measures computed in entire papers like prediction accuracy, prediction error, specificity, sensitivity, f-measure, etc., are learned. Further, the best performance is also checked to confirm the effectiveness of entire contributions. Conclusions The comprehensive research challenges and the gap are portrayed based on the development of intelligent methods concerning the unresolved challenges in heart disease prediction using data mining techniques.


2012 ◽  
pp. 13-22 ◽  
Author(s):  
João Gama ◽  
André C.P.L.F. de Carvalho

Machine learning techniques have been successfully applied to several real world problems in areas as diverse as image analysis, Semantic Web, bioinformatics, text processing, natural language processing,telecommunications, finance, medical diagnosis, and so forth. A particular application where machine learning plays a key role is data mining, where machine learning techniques have been extensively used for the extraction of association, clustering, prediction, diagnosis, and regression models. This text presents our personal view of the main aspects, major tasks, frequently used algorithms, current research, and future directions of machine learning research. For such, it is organized as follows: Background information concerning machine learning is presented in the second section. The third section discusses different definitions for Machine Learning. Common tasks faced by Machine Learning Systems are described in the fourth section. Popular Machine Learning algorithms and the importance of the loss function are commented on in the fifth section. The sixth and seventh sections present the current trends and future research directions, respectively.


Author(s):  
João Gama ◽  
André C.P.L.F. de Carvalho

Machine learning techniques have been successfully applied to several real world problems in areas as diverse as image analysis, Semantic Web, bioinformatics, text processing, natural language processing,telecommunications, finance, medical diagnosis, and so forth. A particular application where machine learning plays a key role is data mining, where machine learning techniques have been extensively used for the extraction of association, clustering, prediction, diagnosis, and regression models. This text presents our personal view of the main aspects, major tasks, frequently used algorithms, current research, and future directions of machine learning research. For such, it is organized as follows: Background information concerning machine learning is presented in the second section. The third section discusses different definitions for Machine Learning. Common tasks faced by Machine Learning Systems are described in the fourth section. Popular Machine Learning algorithms and the importance of the loss function are commented on in the fifth section. The sixth and seventh sections present the current trends and future research directions, respectively.


Author(s):  
Joy Iong-Zong Chen ◽  
Kong-Long Lai

The design of an analogue IC layout is a time-consuming and manual process. Despite several studies in the sector, some geometric restrictions have resulted in disadvantages in the process of automated analogue IC layout design. As a result, analogue design has a performance lag when compared to manual design. This prevents the deployment of a large range of automated tools. With the recent technical developments, this challenge is resolved using machine learning techniques. This study investigates performance-driven placement in the VLSI IC design process, as well as analogue IC performance prediction by utilizing various machine learning approaches. Further, several amplifier designs are simulated. From the simulation results, it is evident that, when compared to the manual layout, an improved performance is obtained by using the proposed approach.


2021 ◽  
Vol 10 (2) ◽  
pp. 62
Author(s):  
Vitória Albuquerque ◽  
Miguel Sales Dias ◽  
Fernando Bacao

Cities are moving towards new mobility strategies to tackle smart cities’ challenges such as carbon emission reduction, urban transport multimodality and mitigation of pandemic hazards, emphasising on the implementation of shared modes, such as bike-sharing systems. This paper poses a research question and introduces a corresponding systematic literature review, focusing on machine learning techniques’ contributions applied to bike-sharing systems to improve cities’ mobility. The preferred reporting items for systematic reviews and meta-analyses (PRISMA) method was adopted to identify specific factors that influence bike-sharing systems, resulting in an analysis of 35 papers published between 2015 and 2019, creating an outline for future research. By means of systematic literature review and bibliometric analysis, machine learning algorithms were identified in two groups: classification and prediction.


2018 ◽  
Vol 7 (S1) ◽  
pp. 82-86
Author(s):  
V. Sudha ◽  
S. Mohan ◽  
S. Arivalagan

Agriculture is the backbone of Indian economy. Big data are emerging précised and viable analytical tool in agricultural research field. This review paper facilitates the farmers in selecting the right crops and appropriate cropping pattern for a particular locality. A modern trend in the Agriculture domain has made people realize the importance of big data. It provides a methodology for facing challenges in agricultural production, by applying this Algorithm, using machine learning techniques. The different machine learning techniques survey is presented in this paper to realize enhanced monitory benefits in a particular area. A study of machine learning algorithms for big data Analytic is also done and presented in this paper.


Author(s):  
R Kanthavel Et.al

Osteoarthritis is mainly a familiar kind of arthritis when an elastic tissue named Cartilage that softens the tops of the bones, cracks down. The Person with osteoarthritis can encompass joint pain, inflexibility, or inflammation and there is no particular examination for osteoarthritis and physicians take the amalgamation of both medical cum clinical record and X-rays imaging analysis to make a diagnosis of the state. Osteoarthritis is generally only detected following ache and bone scratch and in advance, analysis could permit for ultimate involvement to avoid cartilage worsening and bone injury. Through machine-learning algorithms, the system can be trained to automatically distinguish among people who would develop osteoarthritis and persons who would not with the detection of exact biochemical variances in the midpoint of the knee’s cartilage. The outcome of the Machine learning Techniques will give the persons who are pre-symptomatic by the occasion of the baseline imaging and also the reduction in liquid concentration. In this study, we present the analysis of various deep learning techniques for timely detection of osteoarthritis disease. Several subsets of machine learning called deep learning techniques have been in use for the timely detection of osteoarthritis disease; and therefore analysis is needed highly to choose the best as far as accuracy and reliability are concerned.


2021 ◽  
Author(s):  
Tlamelo Emmanuel ◽  
Thabiso Maupong ◽  
Dimane Mpoeleng ◽  
Thabo Semong ◽  
Mphago Banyatsang ◽  
...  

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.


2021 ◽  
Author(s):  
Martin Seeliger ◽  
Marina Altmeyer ◽  
Andreas Ginau ◽  
Robert Schiestl ◽  
Jürgen Wunderlich

<p>This paper presents the application of machine-learning techniques on pXRF data to establish a chronology for sediment cores around Tell Buto (Tell el-Fara´in) in the northwestern Nile Delta. As modern laboratories for dating techniques like OSL or <sup>14</sup>C are rare in Egypt and sample export is restricted, we are facing a lack of opportunities to create a robust chronology, which is indispensable in modern Geoarchaeology.</p><p>Therefore, we present a new approach to transfer archaeological age information gained at the excavation at Buto to corings of the wider Buto area. Sediments of archaeological outcrops and pits with known age are measured using pXRF to create a geochemical “fingerprint” for several historic eras. Afterwards, these “fingerprints” are transferred to corings of the surrounding areas using machine-learning algorithms.</p><p>This paper presents 1) the application of three different machine-learning approaches (Neuronal Net, Random Forest, and C5.0 decision tree) to check if archaeological age information can be transferred to sediments far off the settlement mounds using pXRF data, 2) the comparison of all approaches and the evaluation if the easily anticipated decision tree and Random Forest show similar results as the “black-box system” Neuronal Net, and finally, 3) a case study that provides the results of Altmeyer et al. (in review) for Kom el-Gir, a further settlement mound little north of Buto, with a chronostratigraphic framework based on this approach.</p><p>Reference:</p><p>Altmeyer, M., Seeliger, M., Ginau, A., Schiestl, R. & J. Wunderlich (in review):  Reconstruction of former channel systems in the northwestern Nile Delta (Egypt) based on corings and electrical resistivity tomography (ERT). (Submitted to E & G Quaternary Science Journal).</p>


2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Yijun Zhao ◽  
◽  
Tong Wang ◽  
Riley Bove ◽  
Bruce Cree ◽  
...  

AbstractThe rate of disability accumulation varies across multiple sclerosis (MS) patients. Machine learning techniques may offer more powerful means to predict disease course in MS patients. In our study, 724 patients from the Comprehensive Longitudinal Investigation in MS at Brigham and Women’s Hospital (CLIMB study) and 400 patients from the EPIC dataset, University of California, San Francisco, were included in the analysis. The primary outcome was an increase in Expanded Disability Status Scale (EDSS) ≥ 1.5 (worsening) or not (non-worsening) at up to 5 years after the baseline visit. Classification models were built using the CLIMB dataset with patients’ clinical and MRI longitudinal observations in first 2 years, and further validated using the EPIC dataset. We compared the performance of three popular machine learning algorithms (SVM, Logistic Regression, and Random Forest) and three ensemble learning approaches (XGBoost, LightGBM, and a Meta-learner L). A “threshold” was established to trade-off the performance between the two classes. Predictive features were identified and compared among different models. Machine learning models achieved 0.79 and 0.83 AUC scores for the CLIMB and EPIC datasets, respectively, shortly after disease onset. Ensemble learning methods were more effective and robust compared to standalone algorithms. Two ensemble models, XGBoost and LightGBM were superior to the other four models evaluated in our study. Of variables evaluated, EDSS, Pyramidal Function, and Ambulatory Index were the top common predictors in forecasting the MS disease course. Machine learning techniques, in particular ensemble methods offer increased accuracy for the prediction of MS disease course.


Sign in / Sign up

Export Citation Format

Share Document