Integrating Collector and Author Roles Across Specimen and Publication Datasets

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35866 ◽

2019 ◽

Vol 3 ◽

Author(s):

Nicky Nicolson ◽

Alan Paton ◽

Sarah Phillips ◽

Allan Tucker

Keyword(s):

Data Mining ◽

Unsupervised Learning ◽

Record Linkage ◽

Large Scale ◽

Single Source ◽

Integration Process ◽

Multiple Sources ◽

Large Scale Dataset ◽

Plant Names ◽

International Plant

This work builds on the outputs of a collector data-mining exercise applied to GBIF mobilised herbarium specimen metadata, which uses unsupervised learning (clustering) to identify collectors from minimal metadata associated with field collected specimens (the DarwinCore terms recordedBy, eventDate and recordNumber). Here, we outline methods to integrate these data-mined collector entities (large scale dataset, aggregated from multiple sources, created programatically) with a dataset of author entities from the International Plant Names Index (smaller scale, single source dataset, created via editorial management). The integration process asserts a generic "scientist" entity with activities in different stages of the species description process: collecting and name publication. We present techniques to investigate specialisations including content - taxa of study - and activity stages: examining if individuals focus on collecting and/or name publication. Finally, we discuss generalisations of this initially herbarium-focussed data mining and record linkage process to enable applications in a wider context, particularly in zoological datasets.

Download Full-text

Industrial Paper: Large-scale Record Linkage of Web-based Place Entities

10.5753/sbbd.2019.8820 ◽

2019 ◽

Author(s):

Vinícius M. R. Cousseau ◽

Luciano Barbosa

Keyword(s):

Gini Coefficient ◽

Record Linkage ◽

Large Scale ◽

Practical Approach ◽

Industrial Scale ◽

Multiple Sources ◽

Web Based ◽

The Web

Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.

Download Full-text

The Use of Medical Record Linkage for Population and Genetic Studies

Methods of Information in Medicine ◽

10.1055/s-0038-1635962 ◽

1969 ◽

Vol 08 (01) ◽

pp. 07-11 ◽

Cited By ~ 9

Author(s):

H. B. Newcombe

Keyword(s):

Record Linkage ◽

Large Scale ◽

Medical Record Linkage ◽

Canadian Province ◽

Genetic Studies ◽

Parental Characteristics ◽

Family Histories ◽

The Family ◽

Large Populations ◽

Machine Readable

Methods are described for deriving personal and family histories of birth, marriage, procreation, ill health and death, for large populations, from existing civil registrations of vital events and the routine records of ill health. Computers have been used to group together and »link« the separately derived records pertaining to successive events in the lives of the same individuals and families, rapidly and on a large scale. Most of the records employed are already available as machine readable punchcards and magnetic tapes, for statistical and administrative purposes, and only minor modifications have been made to the manner in which these are produced.As applied to the population of the Canadian province of British Columbia (currently about 2 million people) these methods have already yielded substantial information on the risks of disease: a) in the population, b) in relation to various parental characteristics, and c) as correlated with previous occurrences in the family histories.

Download Full-text

Accelerated Discovery of High-Refractive-Index Polyimides via First-Principles Molecular Modeling, Virtual High-Throughput Screening, and Data Mining

10.26434/chemrxiv.7670903.v1 ◽

2019 ◽

Author(s):

Mohammad Atif Faiz Afzal ◽

Mojtaba Haghighatlari ◽

Sai Prasad Ganesh ◽

Chong Cheng ◽

Johannes Hachmann

Keyword(s):

Data Mining ◽

Refractive Index ◽

High Throughput ◽

First Principles ◽

High Throughput Screening ◽

Large Scale ◽

Computational Study ◽

High Refractive Index ◽

Structural Features ◽

Learning Program

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>

Download Full-text

Survey of Clustering Methods for Large Scale Dataset

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i5.13381344 ◽

2019 ◽

Vol 7 (5) ◽

pp. 1338-1344

Author(s):

Anupama Jawale ◽

Ganesh Magar

Keyword(s):

Large Scale ◽

Clustering Methods ◽

Large Scale Dataset

Download Full-text

Klasifikasi pada Tempat Tinggal Menurut Provinsi dan Jenis Kepemilikan Berdasarkan Algoritma K-Means

STRING (Satuan Tulisan Riset dan Inovasi Teknologi) ◽

10.30998/string.v4i3.5932 ◽

2020 ◽

Vol 4 (3) ◽

pp. 247

Author(s):

Dwi Swasono Rachmad

Keyword(s):

Data Mining ◽

Unsupervised Learning ◽

Residential Buildings ◽

Government Agency ◽

Role Of Government ◽

The Republic ◽

Household Processing ◽

Central Statistics

Housing is derived from the word house which means a place that has a place to live which will stay or stop in a certain time. Housing is a residence that has been grouped into a place that has facilities and infrastructure. The problem in this study focuses on the type of residential ownership in the form of SHM ART, SHM Non ART, NON SHM and others. These four types can be used to know the percentage of ownership in all provinces in Indonesia. Due to the fact that there is still a lot of information about the type of certificate ownership, there is still not much ownership. Therefore, the use of the k-Means algorithm as a data mining concept in the form of clusters, where the data already has parameters or values that fall into the category of unsupervised learning. That data produced the best. The data was obtained from published sources of the Republic of Indonesia government agency, namely the Central Statistics Agency data with the category of household processing with self-owned residential buildings purchased from developers or non-developers by province and type of ownership in 2016 throughout Indonesia. In conducting the dataset, researchers used the RapidMiner application as a clustering process application. This research shows that there are more types of ownership in the SHM ART, but for other values it is still smaller than the value in other types of ownership which is the second largest value. So, in this case, the role of government in providing assistance in the process of ownership in order to become SHM ART is very important.

Download Full-text

Joint regression and learning from pairwise rankings for personalized image aesthetic assessment

Computational Visual Media ◽

10.1007/s41095-021-0207-y ◽

2021 ◽

Author(s):

Jin Zhou ◽

Qing Zhang ◽

Jian-Hao Fan ◽

Wei Sun ◽

Wei-Shi Zheng

Keyword(s):

Large Scale ◽

Assessment Model ◽

Generic Model ◽

Small Subset ◽

Deep Convolutional Neural Networks ◽

Personal Taste ◽

Hinge Loss ◽

Novel Approach ◽

Large Scale Dataset ◽

Image Pairs

AbstractRecent image aesthetic assessment methods have achieved remarkable progress due to the emergence of deep convolutional neural networks (CNNs). However, these methods focus primarily on predicting generally perceived preference of an image, making them usually have limited practicability, since each user may have completely different preferences for the same image. To address this problem, this paper presents a novel approach for predicting personalized image aesthetics that fit an individual user’s personal taste. We achieve this in a coarse to fine manner, by joint regression and learning from pairwise rankings. Specifically, we first collect a small subset of personal images from a user and invite him/her to rank the preference of some randomly sampled image pairs. We then search for the K-nearest neighbors of the personal images within a large-scale dataset labeled with average human aesthetic scores, and use these images as well as the associated scores to train a generic aesthetic assessment model by CNN-based regression. Next, we fine-tune the generic model to accommodate the personal preference by training over the rankings with a pairwise hinge loss. Experiments demonstrate that our method can effectively learn personalized image aesthetic preferences, clearly outperforming state-of-the-art methods. Moreover, we show that the learned personalized image aesthetic benefits a wide variety of applications.

Download Full-text

VIPPrint: Validating Synthetic Image Detection and Source Linking Methods on a Large Scale Dataset of Printed Documents

Journal of Imaging ◽

10.3390/jimaging7030050 ◽

2021 ◽

Vol 7 (3) ◽

pp. 50

Author(s):

Anselmo Ferreira ◽

Ehsan Nowroozi ◽

Mauro Barni

Keyword(s):

Large Scale ◽

State Of The Art ◽

Child Pornography ◽

Forensic Analysis ◽

Synthetic Image ◽

Image Detection ◽

Face Images ◽

Large Scale Dataset ◽

Scanned Images ◽

Analysis Of The Images

The possibility of carrying out a meaningful forensic analysis on printed and scanned images plays a major role in many applications. First of all, printed documents are often associated with criminal activities, such as terrorist plans, child pornography, and even fake packages. Additionally, printing and scanning can be used to hide the traces of image manipulation or the synthetic nature of images, since the artifacts commonly found in manipulated and synthetic images are gone after the images are printed and scanned. A problem hindering research in this area is the lack of large scale reference datasets to be used for algorithm development and benchmarking. Motivated by this issue, we present a new dataset composed of a large number of synthetic and natural printed face images. To highlight the difficulties associated with the analysis of the images of the dataset, we carried out an extensive set of experiments comparing several printer attribution methods. We also verified that state-of-the-art methods to distinguish natural and synthetic face images fail when applied to print and scanned images. We envision that the availability of the new dataset and the preliminary experiments we carried out will motivate and facilitate further research in this area.

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Accessible Routes Integrating Data from Multiple Sources

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10010007 ◽

2020 ◽

Vol 10 (1) ◽

pp. 7

Author(s):

Miguel R. Luaces ◽

Jesús A. Fisteus ◽

Luis Sánchez-Fernández ◽

Mario Munoz-Organero ◽

Jesús Balado ◽

...

Keyword(s):

Information System ◽

Data Model ◽

Large Scale ◽

Heterogeneous Data ◽

Multiple Sources ◽

Heterogeneous Data Sources ◽

Different Types ◽

Software Sensors ◽

The City

Providing citizens with the ability to move around in an accessible way is a requirement for all cities today. However, modeling city infrastructures so that accessible routes can be computed is a challenge because it involves collecting information from multiple, large-scale and heterogeneous data sources. In this paper, we propose and validate the architecture of an information system that creates an accessibility data model for cities by ingesting data from different types of sources and provides an application that can be used by people with different abilities to compute accessible routes. The article describes the processes that allow building a network of pedestrian infrastructures from the OpenStreetMap information (i.e., sidewalks and pedestrian crossings), improving the network with information extracted obtained from mobile-sensed LiDAR data (i.e., ramps, steps, and pedestrian crossings), detecting obstacles using volunteered information collected from the hardware sensors of the mobile devices of the citizens (i.e., ramps and steps), and detecting accessibility problems with software sensors in social networks (i.e., Twitter). The information system is validated through its application in a case study in the city of Vigo (Spain).

Download Full-text

PEDRERA. Positive Energy District Renovation Model for Large Scale Actions

Energies ◽

10.3390/en14102833 ◽

2021 ◽

Vol 14 (10) ◽

pp. 2833

Author(s):

Paolo Civiero ◽

Jordi Pascual ◽

Joaquim Arcas Abella ◽

Ander Bilbao Figuero ◽

Jaume Salom

Keyword(s):

Simulation Model ◽

Performance Indicators ◽

Large Scale ◽

Key Performance Indicators ◽

Positive Energy ◽

Design Phase ◽

Multiple Sources ◽

Reliable Prediction ◽

Sensitive Analysis ◽

Web Platform

In this paper, we provide a view of the ongoing PEDRERA project, whose main scope is to design a district simulation model able to set and analyze a reliable prediction of potential business scenarios on large scale retrofitting actions, and to evaluate the overall co-benefits resulting from the renovation process of a cluster of buildings. According to this purpose and to a Positive Energy Districts (PEDs) approach, the model combines systemized data—at both building and district scale—from multiple sources and domains. A sensitive analysis of 200 scenarios provided a quick perception on how results will change once inputs are defined, and how attended results will answer to stakeholders’ requirements. In order to enable a clever input analysis and to appraise wide-ranging ranks of Key Performance Indicators (KPIs) suited to each stakeholder and design phase targets, the model is currently under the implementation in the urbanZEB tool’s web platform.

Download Full-text