Building a large corpus based on newspapers from the web

Author(s):  
Gisle Andersen ◽  
Knut Hofland
Keyword(s):  
2021 ◽  
Vol 13 (4) ◽  
pp. 1-35
Author(s):  
Gabriel Amaral ◽  
Alessandro Piscopo ◽  
Lucie-aimée Kaffee ◽  
Odinaldo Rodrigues ◽  
Elena Simperl

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.


2016 ◽  
Vol 2016 ◽  
pp. 1-12
Author(s):  
Xi Zhang ◽  
Yuntao Yao ◽  
Yingsheng Ji ◽  
Binxing Fang

Detecting near duplicates on the web is challenging due to its volume and variety. Most of the previous studies require the setting of input parameters, making it difficult for them to achieve robustness across various scenarios without careful tuning. Recently, a universal and parameter-free similarity metric, the normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied to medium-to-large datasets as it lacks efficiency and tends to get skewed by large object size. To make this parameter-free method feasible on a large corpus of web documents, we propose a new method called SigNCD which measures NCD based on lightweight signatures instead of full documents, leading to improved efficiency and stability. We derive various lower bounds of NCD and propose pruning policies to further reduce computational complexity. We evaluate SigNCD on both English and Chinese datasets and show an increase inF1score compared with the original NCD method and a significant reduction in runtime. Comparisons with other competitive methods also demonstrate the superiority of our method. Moreover, no parameter tuning is required in SigNCD, except a similarity threshold.


2012 ◽  
Vol 3 (1) ◽  
pp. 64-77
Author(s):  
Peng Jiang ◽  
Qing Yang ◽  
Chunxia Zhang ◽  
Zhendong Niu ◽  
Hongping Fu

As the Web is becoming the largest knowledge repository which contains various entities and their relations, the task of related entity retrieval excites interest in the field of information retrieval. This challenging task is introduced in TREC 2009 Entity Track. In this task, given an entity and the type of the target entity, a retrieval system is required to return a ranked list of related entities extracted from a given large corpus. It means that entity ranking goes beyond entity relevance and integrates the judgment of relation into the evaluation of the retrieved entities. This paper proposes a probability model using relation patterns to address the task of related entity retrieval. This model takes into account both relevance and relation between entities. The authors focus on using relation patterns to measure the level of relations matching between entities, and then to estimate the probability of occurrence of relation between two entities. In addition, the authors represent entity by its context language model and measure the relevance between two entities by a language model. Experimental results on TREC Entity Track dataset show that the proposed model significantly improves retrieval performances over baseline. The comparison with other approaches also reveals the effectiveness of the model.


Author(s):  
Francesca Frontini ◽  
Carmen Brando ◽  
Marine Riguet ◽  
Clémence Jacquot ◽  
Vincent Jolivet

This paper aims to discuss the challenges and benefits of the annotation of place names in literary texts and literary criticism. We shall first highlight the problems of encoding spatial information in digital editions using the TEI format by means of  two manual annotation experiments and the discussion of various cases. This will lead to the question of  how to use existing semantic web resources to complement and enrich toponym mark-up, in particular to provide mentions with precise geo-referencing. Finally the automatic annotation of a large corpus will show the potential of visualizing places from texts, by illustrating an analysis of the evolution of literary life from the spatial and geographical point of view.


2008 ◽  
Vol 11 (2) ◽  
pp. 83-85
Author(s):  
Howard Wilson
Keyword(s):  

2005 ◽  
Vol 8 (1) ◽  
pp. 16-18
Author(s):  
Howard F. Wilson
Keyword(s):  

1999 ◽  
Vol 3 (2) ◽  
pp. 6-6
Author(s):  
Barbara Shadden
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document