scholarly journals Industrial Paper: Large-scale Record Linkage of Web-based Place Entities

2019 ◽  
Author(s):  
Vinícius M. R. Cousseau ◽  
Luciano Barbosa

Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.

BMC Nursing ◽  
2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Tanja Gustafsson ◽  
Annelie J Sundler ◽  
Elisabeth Lindberg ◽  
Pernilla Karlsson ◽  
Hanna Maurin Söderholm

Abstract Background There is currently a strong emphasis on person-centred care (PCC) and communication; however, little research has been conducted on how to implement person-centred communication in home care settings. Therefore, the ACTION (A person-centred CommunicaTION) programme, which is a web-based education programme focusing on person-centred communication developed for nurse assistants (NAs) providing home care for older persons, was implemented. This paper reports on the process evaluation conducted with the aim to describe and evaluate the implementation of the ACTION programme. Methods A descriptive design with a mixed method approach was used. Twenty-seven NAs from two units in Sweden were recruited, and 23 of them were offered the educational intervention. Quantitative and qualitative data were collected from multiple sources before, during and after the implementation. Quantitative data were used to analyse demographics, attendance and participation, while qualitative data were used to evaluate experiences of the implementation and contextual factors influencing the implementation. Results The evaluation showed a high degree of NA participation in the first five education modules, and a decrease in the three remaining modules. Overall, the NAs perceived the web format to be easy to use and appreciated the flexibility and accessibility. The content was described as important. Challenges included time constraints; the heavy workload; and a lack of interaction, space and equipment to complete the programme. Conclusions The results suggest that web-based education seems to be an appropriate strategy in home care settings; however, areas for improvement were identified. Our findings show that participants appreciated the web-based learning format in terms of accessibility and flexibility, as well as the face-to-face group discussions. The critical importance of organizational support and available resources are highlighted, such as management involvement and local facilitation. In addition, the findings report on the implementation challenges specific to the dynamic home care context. Trial registration This intervention was implemented with nursing assistants, and the evaluation only involved nursing staff. Patients were not part of this study. According to the ICMJE, registration was not necessary ().


2017 ◽  
Vol 6 ◽  
Author(s):  
Saskia Meijboom ◽  
Martinette T. van Houts-Streppel ◽  
Corine Perenboom ◽  
Els Siebelink ◽  
Anne M. van de Wiel ◽  
...  

AbstractSelf-administered web-based 24-h dietary recalls (24 hR) may save a lot of time and money as compared with interviewer-administered telephone-based 24 hR interviews and may therefore be useful in large-scale studies. Within the Nutrition Questionnaires plus (NQplus) study, the web-based 24 hR tool Compl-eat™ was developed to assess Dutch participants’ dietary intake. The aim of the present study was to evaluate the performance of this tool against the interviewer-administered telephone-based 24 hR method. A subgroup of participants of the NQplus study (20–70 years, n 514) completed three self-administered web-based 24 hR and three telephone 24 hR interviews administered by a dietitian over a 1-year period. Compl-eat™ as well as the dietitians guided the participants to report all foods consumed the previous day. Compl-eat™ on average underestimated the intake of energy by 8 %, of macronutrients by 10 % and of micronutrients by 13 % as compared with telephone recalls. The agreement between both methods, estimated using Lin's concordance coefficients (LCC), ranged from 0·15 for vitamin B1 to 0·70 for alcohol intake (mean LCC 0·38). The lower estimations by Compl-eat™ can be explained by a lower number of total reported foods and lower estimated intakes of the food groups, fats, oils and savoury sauces, sugar and confectionery, dairy and cheese. The performance of the tool may be improved by, for example, adding an option to automatically select frequently used foods and including more recall cues. We conclude that Compl-eat™ may be a useful tool in large-scale Dutch studies after suggested improvements have been implemented and evaluated.


Author(s):  
Christopher Walton

At the start of this book we outlined the challenges of automatic computer based processing of information on the Web. These numerous challenges are generally referred to as the ‘vision’ of the Semantic Web. From the outset, we have attempted to take a realistic and pragmatic view of this vision. Our opinion is that the vision may never be fully realized, but that it is a useful goal on which to focus. Each step towards the vision has provided new insights on classical problems in knowledge representation, MASs, and Web-based techniques. Thus, we are presently in a significantly better position as a result of these efforts. It is sometimes difficult to see the purpose of the Semantic Web vision behind all of the different technologies and acronyms. However, the fundamental purpose of the Semantic Web is essentially large scale and automated data integration. The Semantic Web is not just about providing a more intelligent kind of Web search, but also about taking the results of these searches and combining them in interesting and useful ways. As stated in Chapter 1, the possible applications for the Semantic Web include: automated data mining, e-science experiments, e-learning systems, personalized newspapers and journals, and intelligent devices. The current state of progress towards the Semantic Web vision is summarized in Figure 8.1. This figure shows a pyramid with the human-centric Web at the bottom, sometimes termed the Syntactic Web, and the envisioned Semantic Web at the top. Throughout this book, we have been moving upwards on this pyramid, and it should be clear that a great deal of progress that has been made towards the goal. This progress is indicated by the various stages of the pyramid, which can be summarized as follows: • The lowest stage on the pyramid is the basic Web that should be familiar to everyone. This Web of information is human-centric and contains very little automation. Nonetheless, the Web provides the basic protocols and technologies on which the Semantic Web is founded. Furthermore, the information which is represented on the Web will ultimately be the source of knowledge for the Semantic Web.


2017 ◽  
Vol 25 (1) ◽  
pp. 149-160 ◽  
Author(s):  
Giovanni Benedetto ◽  
Alessia Di Prima ◽  
Salvatore Sciacca ◽  
Giuseppe Grosso

We described the design of a web-based application (the Software Integrated Cancer Registry—SWInCaRe) used to administer data in a cancer registry and tested its validity and usability. A sample of 11,680 records was considered to compare the manual and automatic procedures. Sensibility and specificity, the Health IT Usability Evaluation Scale, and a cost-efficiency analysis were tested. Several data sources were used to build data packages through text-mining and record linkage algorithms. The automatic procedure showed small yet measurable improvements in both data linkage process and cancer cases estimation. Users perceived the application as useful to improve the time of coding and difficulty of the process: both time and cost-analysis were in favor of the automatic procedure. The web-based application resulted in a useful tool for the cancer registry, but some improvements are necessary to overcome limitations observed and to further automatize the process.


2013 ◽  
Vol 21 (1) ◽  
pp. 3-47 ◽  
Author(s):  
IDAN SZPEKTOR ◽  
HRISTO TANEV ◽  
IDO DAGAN ◽  
BONAVENTURA COPPOLA ◽  
MILEN KOUYLEKOV

AbstractEntailment recognition is a primary generic task in natural language inference, whose focus is to detect whether the meaning of one expression can be inferred from the meaning of the other. Accordingly, many NLP applications would benefit from high coverage knowledgebases of paraphrases and entailment rules. To this end, learning such knowledgebases from the Web is especially appealing due to its huge size as well as its highly heterogeneous content, allowing for a more scalable rule extraction of various domains. However, the scalability of state-of-the-art entailment rule acquisition approaches from the Web is still limited. We present a fully unsupervised learning algorithm for Web-based extraction of entailment relations. We focus on increased scalability and generality with respect to prior work, with the potential of a large-scale Web-based knowledgebase. Our algorithm takes as its input a lexical–syntactic template and searches the Web for syntactic templates that participate in an entailment relation with the input template. Experiments show promising results, achieving performance similar to a state-of-the-art unsupervised algorithm, operating over an offline corpus, but with the benefit of learning rules for different domains with no additional effort.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5650 ◽  
Author(s):  
Yannan Fan ◽  
Maria Habib ◽  
Jianguo Xia

Xeno-miRNAs are microRNAs originating from exogenous species detected in host biofluids. A growing number of studies have suggested that many of these xeno-miRNAs may be involved in cross-species interactions and manipulations. To date, hundreds of xeno-miRNAs have been reported in different hosts at various abundance levels. Based on computational predictions, many more miRNAs could be potentially transferred to human circulation system. There is a clear need for bioinformatics resources and tools dedicated to xeno-miRNA annotations and their potential functions. To address this need, we have systematically curated xeno-miRNAs from multiple sources, performed target predictions using well-established algorithms, and developed a user-friendly web-based tool—Xeno-miRNet—to allow researchers to search and explore xeno-miRNAs and their potential targets within different host species. Xeno-miRNet currently contains 1,702 (including both detected and predicted) xeno-miRNAs from 54 species and 98,053 potential gene targets in six hosts. The web application is freely available at http://xeno.mirnet.ca.


Author(s):  
Xiuzhen Feng

The word portal has been citied in the literature as one of the most popular terms. A Google search on the Web for the word revealed 25.6 million entries in December2003. Due to a considerable degree of overuse and overlap, portals are seen everywhere and it would be difficult to make any use of the Web without encountering one (Tatnall, 2004). According to White (2000), a portal provides user-customizable access to information and applications through a Web browser. Tatnall (2004) specifies that a portal aggregates information from multiple sources and makes that information available to various users. In other words, a portal can be defined as an integrated and personalized Web-based application that provides the end user with a single point of access to a wide variety of aggregated content anytime and from anywhere using any Web-enabled client device.


Author(s):  
Nicky Nicolson ◽  
Alan Paton ◽  
Sarah Phillips ◽  
Allan Tucker

This work builds on the outputs of a collector data-mining exercise applied to GBIF mobilised herbarium specimen metadata, which uses unsupervised learning (clustering) to identify collectors from minimal metadata associated with field collected specimens (the DarwinCore terms recordedBy, eventDate and recordNumber). Here, we outline methods to integrate these data-mined collector entities (large scale dataset, aggregated from multiple sources, created programatically) with a dataset of author entities from the International Plant Names Index (smaller scale, single source dataset, created via editorial management). The integration process asserts a generic "scientist" entity with activities in different stages of the species description process: collecting and name publication. We present techniques to investigate specialisations including content - taxa of study - and activity stages: examining if individuals focus on collecting and/or name publication. Finally, we discuss generalisations of this initially herbarium-focussed data mining and record linkage process to enable applications in a wider context, particularly in zoological datasets.


Author(s):  
Diana Irina Tanase ◽  
Epaminondas Kapetanios

Combining existing advancements in cross-language information retrieval (CLIR) with the new usercentered Web paradigm could allow tapping into Web-based multilingual clusters of language information that are rich, up-to-date in terms of language usage, that increase in size, and have the potential to cater for all languages. In this chapter, we set out to explore existing CLIR systems and their limitations, and we argue that in the current context of a widely adopted social Web, the future of large-scale CLIR and iCLIR systems is linked to the use of the Web as a lexical resource, as a distribution infrastructure, and as a channel of communication between users. Such a synergy will lead to systems that grow organically as more users with different linguistic skills join the network, and that improve in terms of language translations disambiguation and coverage.


Sign in / Sign up

Export Citation Format

Share Document