Building a large corpus based on newspapers from the web

Review of Andersen (2012): Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian

Terminology ◽

10.1075/term.19.1.07ber ◽

2013 ◽

Vol 19 (1) ◽

pp. 143-148

Author(s):

Gabriel Bernier-Colborne

Keyword(s):

Large Corpus ◽

The Web

Download Full-text

Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach

Journal of Data and Information Quality ◽

10.1145/3484828 ◽

2021 ◽

Vol 13 (4) ◽

pp. 1-35

Author(s):

Gabriel Amaral ◽

Alessandro Piscopo ◽

Lucie-aimée Kaffee ◽

Odinaldo Rodrigues ◽

Elena Simperl

Keyword(s):

Machine Learning ◽

Scale Up ◽

Hybrid Approach ◽

Structured Data ◽

Mixed Methods Study ◽

Secondary Source ◽

Ease Of Access ◽

Large Corpus ◽

The Web

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.

Download Full-text

Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics

Mathematical Problems in Engineering ◽

10.1155/2016/3919043 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12

Author(s):

Xi Zhang ◽

Yuntao Yao ◽

Yingsheng Ji ◽

Binxing Fang

Keyword(s):

Parameter Tuning ◽

Large Datasets ◽

Object Size ◽

Large Object ◽

Web Documents ◽

Normalized Compression Distance ◽

Similarity Threshold ◽

Near Duplicate Detection ◽

Large Corpus ◽

The Web

Detecting near duplicates on the web is challenging due to its volume and variety. Most of the previous studies require the setting of input parameters, making it difficult for them to achieve robustness across various scenarios without careful tuning. Recently, a universal and parameter-free similarity metric, the normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied to medium-to-large datasets as it lacks efficiency and tends to get skewed by large object size. To make this parameter-free method feasible on a large corpus of web documents, we propose a new method called SigNCD which measures NCD based on lightweight signatures instead of full documents, leading to improved efficiency and stability. We derive various lower bounds of NCD and propose pruning policies to further reduce computational complexity. We evaluate SigNCD on both English and Chinese datasets and show an increase inF1score compared with the original NCD method and a significant reduction in runtime. Comparisons with other competitive methods also demonstrate the superiority of our method. Moreover, no parameter tuning is required in SigNCD, except a similarity threshold.

Download Full-text

A Relation Pattern-Driven Probability Model for Related Entity Retrieval

International Journal of Knowledge and Systems Science ◽

10.4018/jkss.2012010105 ◽

2012 ◽

Vol 3 (1) ◽

pp. 64-77

Author(s):

Peng Jiang ◽

Qing Yang ◽

Chunxia Zhang ◽

Zhendong Niu ◽

Hongping Fu

Keyword(s):

Retrieval System ◽

Probability Model ◽

Language Model ◽

Knowledge Repository ◽

Entity Ranking ◽

Proposed Model ◽

Entity Retrieval ◽

Ranked List ◽

Large Corpus ◽

The Web

As the Web is becoming the largest knowledge repository which contains various entities and their relations, the task of related entity retrieval excites interest in the field of information retrieval. This challenging task is introduced in TREC 2009 Entity Track. In this task, given an entity and the type of the target entity, a retrieval system is required to return a ranked list of related entities extracted from a given large corpus. It means that entity ranking goes beyond entity relevance and integrates the judgment of relation into the evaluation of the retrieved entities. This paper proposes a probability model using relation patterns to address the task of related entity retrieval. This model takes into account both relevance and relation between entities. The authors focus on using relation patterns to measure the level of relations matching between entities, and then to estimate the probability of occurrence of relation between two entities. In addition, the authors represent entity by its context language model and measure the relevance between two entities by a language model. Experimental results on TREC Entity Track dataset show that the proposed model significantly improves retrieval performances over baseline. The comparison with other approaches also reveals the effectiveness of the model.

Download Full-text

Annotation of Toponyms in TEI Digital Literary Editions and Linking to the Web of Data

Matlit Revista do Programa de Doutoramento em Materialidades da Literatura ◽

10.14195/2182-8830_4-2_3 ◽

2016 ◽

Vol 4 (2) ◽

pp. 49-75 ◽

Cited By ~ 1

Author(s):

Francesca Frontini ◽

Carmen Brando ◽

Marine Riguet ◽

Clémence Jacquot ◽

Vincent Jolivet

Keyword(s):

Literary Criticism ◽

Spatial Information ◽

Point Of View ◽

Literary Texts ◽

Automatic Annotation ◽

Web Resources ◽

Web Of Data ◽

Geographical Point ◽

Large Corpus ◽

The Web

This paper aims to discuss the challenges and benefits of the annotation of place names in literary texts and literary criticism. We shall first highlight the problems of encoding spatial information in digital editions using the TEI format by means of two manual annotation experiments and the discussion of various cases. This will lead to the question of how to use existing semantic web resources to complement and enrich toponym mark-up, in particular to provide mentions with precise geo-referencing. Finally the automatic annotation of a large corpus will show the potential of visualizing places from texts, by illustrating an analysis of the evolution of literary life from the spatial and geographical point of view.

Download Full-text