Entity Disambiguation with Web Links

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Foreign Language Optical Character Recognition, Phase II: Arabic and Persian Training and Test Data Sets.

10.21236/ada325444 ◽

1997 ◽

Author(s):

Robert B. Davidson ◽

Richard L. Hopely

Keyword(s):

Foreign Language ◽

Phase Ii ◽

Test Data ◽

Character Recognition ◽

Optical Character Recognition ◽

Data Sets ◽

Optical Character ◽

Recognition Phase

Download Full-text

Analytical stress-strain relation for soils

Canadian Geotechnical Journal ◽

10.1139/cgj-2021-0043 ◽

2021 ◽

Author(s):

Kristian Krabbenhoft ◽

J. Wang

Keyword(s):

Test Data ◽

Narrow Range ◽

Strain Range ◽

Data Sets ◽

Soil Tests ◽

Stress Strain ◽

Strain Relation ◽

Stress Strain Relation ◽

Typical Test ◽

Typical Soil

A new stress-strain relation capable of reproducing the entire stress-strain range of typical soil tests is presented. The new relation involves a total of five parameters, four of which can be inferred directly from typical test data. The fifth parameter is a fitting parameter with a relatively narrow range. The capabilities of the new relation is demonstrated by the application to various clay and sand data sets.

Download Full-text

Improving SAR Altimeter processing over the coastal zone and inland waters - the ESA HYDROCOASTAL project

10.5194/egusphere-egu21-9 ◽

2021 ◽

Author(s):

David Cotton ◽

Keyword(s):

Coastal Zone ◽

Test Data ◽

River Discharge ◽

Altimeter Data ◽

Inland Waters ◽

Data Sets ◽

Data Set ◽

Discharge Data ◽

Processing Algorithms ◽

The Impact

IntroductionHYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.A series of case studies will assess these products in terms of their scientific impacts.All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&#160;ObjectivesThe scientific objectives of HYDROCOASTAL are to enhance our understanding&#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changesThe technical objectives are to develop and evaluate&#160; new SAR&#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.Project&#160; OutlineThere are four tasks to the project<ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul>&#160;PresentationThe presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&#160;

Download Full-text

Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v10.i01.p14 ◽

2021 ◽

Vol 10 (1) ◽

pp. 105

Author(s):

I Gusti Ayu Purnami Indryaswari ◽

Ida Bagus Made Mahendra

Keyword(s):

Programming Language ◽

Test Data ◽

Training Data ◽

Data Sets ◽

Android Application ◽

C4.5 Algorithm ◽

Sqlite Database

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.

Download Full-text

A Novel Approach for Crawling the Opinions from World Wide Web

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2016040101 ◽

2016 ◽

Vol 6 (2) ◽

pp. 1-23 ◽

Cited By ~ 4

Author(s):

Surbhi Bhatia ◽

Manisha Sharma ◽

Komal Kumar Bhatia

Keyword(s):

World Wide ◽

Opinion Mining ◽

Real Data ◽

User Generated Content ◽

Decision Making Process ◽

Web Pages ◽

Data Sets ◽

Web Technologies ◽

Design And Implementation ◽

Novel Approach

Due to the sudden and explosive increase in web technologies, huge quantity of user generated content is available online. The experiences of people and their opinions play an important role in the decision making process. Although facts provide the ease of searching information on a topic but retrieving opinions is still a crucial task. Many studies on opinion mining have to be undertaken efficiently in order to extract constructive opinionated information from these reviews. The present work focuses on the design and implementation of an Opinion Crawler which downloads the opinions from various sites thereby, ignoring rest of the web. Besides, it also detects web pages which frequently undergo updation by calculating the timestamp for its revisit in order to extract relevant opinions. The performance of the Opinion Crawler is justified by taking real data sets that prove to be much more accurate in terms of precision and recall quality attributes.

Download Full-text

Novel method for bearing performance degradation assessment – A kernel locality preserving projection-based approach

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1177/0954406213486735 ◽

2013 ◽

Vol 228 (3) ◽

pp. 548-560 ◽

Cited By ~ 18

Author(s):

Chuang Sun ◽

Zhousuo Zhang ◽

Zhengjia He ◽

Zhongjie Shen ◽

Binqiang Chen ◽

...

Keyword(s):

Test Data ◽

Performance Degradation ◽

Inner Product ◽

Data Sets ◽

Degradation Index ◽

Locality Preserving Projection ◽

Locality Preserving ◽

Non Linear ◽

Linear Information ◽

Novel Method

Bearing performance degradation assessment is meaningful for keeping mechanical reliability and safety. For this purpose, a novel method based on kernel locality preserving projection is proposed in this article. Kernel locality preserving projection extends the traditional locality preserving projection into the non-linear form by using a kernel function and it is more appropriate to explore the non-linear information hidden in the data sets. Considering this point, the kernel locality preserving projection is used to generate a non-linear subspace from the normal bearing data. The test data are then projected onto the subspace to obtain an index for assessing bearing degradation degrees. The degradation index that is expressed in the form of inner product indicates similarity of the normal data and the test data. Validations by using monitoring data from two experiments show the effectiveness of the proposed method.

Download Full-text

Performance of a machine-learning algorithm for fully automatic LGE scar quantification in the large multi-national derivate registry

European Heart Journal - Cardiovascular Imaging ◽

10.1093/ehjci/jeab090.023 ◽

2021 ◽

Vol 22 (Supplement_2) ◽

Author(s):

F Ghanbari ◽

T Joyce ◽

S Kozerke ◽

AI Guaricci ◽

PG Masci ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Learning Algorithm ◽

Test Time ◽

Breath Hold ◽

Human Observer ◽

Data Sets ◽

Observer Variability ◽

General Electric ◽

Total N

Abstract Funding Acknowledgements Type of funding sources: Other. Main funding source(s): J. Schwitter receives research support by “ Bayer Schweiz AG “. C.N.C. received grant by Siemens. Gianluca Pontone received institutional fees by General Electric, Bracco, Heartflow, Medtronic, and Bayer. U.J.S received grand by Astellas, Bayer, General Electric. This work was supported by Italian Ministry of Health, Rome, Italy (RC 2017 R659/17-CCM698). This work was supported by Gyrotools, Zurich, Switzerland. Background Late Gadolinium enhancement (LGE) scar quantification is generally recognized as an accurate and reproducible technique, but it is observer-dependent and time consuming. Machine learning (ML) potentially offers to solve this problem. Purpose to develop and validate a ML-algorithm to allow for scar quantification thereby fully avoiding observer variability, and to apply this algorithm to the prospective international multicentre Derivate cohort. Method The Derivate Registry collected heart failure patients with LV ejection fraction <50% in 20 European and US centres. In the post-myocardial infarction patients (n = 689) quality of the LGE short-axis breath-hold images was determined (good, acceptable, sufficient, borderline, poor, excluded) and ground truth (GT) was produced (endo-epicardial contours, 2 remote reference regions, artefact elimination) to determine mass of non-infarcted myocardium and of dense (≥5SD above mean-remote) and non-dense scar (>2SD to <5SD above mean-remote). Data were divided into the learning (total n = 573; training: n = 289; testing: n = 284) and validation set (n = 116). A Ternaus-network (loss function = average of dice and binary-cross-entropy) produced 4 outputs (initial prediction, test time augmentation (TTA), threshold-based prediction (TB), and TTA + TB) representing normal myocardium, non-dense, and dense scar (Figure 1).Outputs were evaluated by dice metrics, Bland-Altman, and correlations. Results In the validation and test data sets, both not used for training, the dense scar GT was 20.8 ± 9.6% and 21.9 ± 13.3% of LV mass, respectively. The TTA-network yielded the best results with small biases vs GT (-2.2 ± 6.1%, p < 0.02; -1.7 ± 6.0%, p < 0.003, respectively) and 95%CI vs GT in the range of inter-human comparisons, i.e. TTA yielded SD of the differences vs GT in the validation and test data of 6.1 and 6.0 percentage points (%p), respectively (Fig 2), which was comparable to the 7.7%p for the inter-observer comparison (n = 40). For non-dense scar, TTA performance was similar with small biases (-1.9 ± 8.6%, p < 0.0005, -1.4 ± 8.2%, p < 0.0001, in the validation and test sets, respectively, GT 39.2 ± 13.8% and 42.1 ± 14.2%) and acceptable 95%CI with SD of the differences of 8.6 and 8.2%p for TTA vs GT, respectively, and 9.3%p for inter-observer. Conclusions In the large Derivate cohort from 20 centres, performance of the presented ML-algorithm to quantify dense and non-dense scar fully automatically is comparable to that of experienced humans with small bias and acceptable 95%-CI. Such a tool could facilitate scar quantification in clinical routine as it eliminates human observer variability and can handle large data sets.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.7540 ◽

2015 ◽

Vol 9 (2) ◽

pp. 111 ◽

Cited By ~ 1

Author(s):

Erma Susanti ◽

Khabib Mustofa

Keyword(s):

Information Extraction ◽

Language Processing ◽

Semantic Content ◽

Extraction Process ◽

Web Pages ◽

Structured Information ◽

Improved Performance ◽

Types Of Information ◽

Unstructured Information

AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

Download Full-text