Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation

Jonathan Bac; Evgeny M. Mirkes; Alexander N. Gorban; Ivan Tyukin; Andrei Zinovyev

doi:10.3390/e23101368

A Combination of Spatial Pyramid and Inverted Index for Large-Scale Image Retrieval

Computer Vision ◽

10.4018/978-1-5225-5204-8.ch054 ◽

2018 ◽

pp. 1307-1321

Author(s):

Vinh-Tiep Nguyen ◽

Thanh Duc Ngo ◽

Minh-Triet Tran ◽

Duy-Dinh Le ◽

Duc Anh Duong

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Spatial Information ◽

Real Life ◽

Inverted Index ◽

Bag Of Words ◽

Visual Words ◽

Benchmark Datasets ◽

Large Scale Image Retrieval ◽

Inverted Indexing

Large-scale image retrieval has been shown remarkable potential in real-life applications. The standard approach is based on Inverted Indexing, given images are represented using Bag-of-Words model. However, one major limitation of both Inverted Index and Bag-of-Words presentation is that they ignore spatial information of visual words in image presentation and comparison. As a result, retrieval accuracy is decreased. In this paper, the authors investigate an approach to integrate spatial information into Inverted Index to improve accuracy while maintaining short retrieval time. Experiments conducted on several benchmark datasets (Oxford Building 5K, Oxford Building 5K+100K and Paris 6K) demonstrate the effectiveness of our proposed approach.

Download Full-text

A Combination of Spatial Pyramid and Inverted Index for Large-Scale Image Retrieval

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2015040103 ◽

2015 ◽

Vol 6 (2) ◽

pp. 37-51 ◽

Cited By ~ 2

Author(s):

Vinh-Tiep Nguyen ◽

Thanh Duc Ngo ◽

Minh-Triet Tran ◽

Duy-Dinh Le ◽

Duc Anh Duong

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Spatial Information ◽

Real Life ◽

Inverted Index ◽

Bag Of Words ◽

Visual Words ◽

Benchmark Datasets ◽

Large Scale Image Retrieval ◽

Inverted Indexing

Large-scale image retrieval has been shown remarkable potential in real-life applications. The standard approach is based on Inverted Indexing, given images are represented using Bag-of-Words model. However, one major limitation of both Inverted Index and Bag-of-Words presentation is that they ignore spatial information of visual words in image presentation and comparison. As a result, retrieval accuracy is decreased. In this paper, the authors investigate an approach to integrate spatial information into Inverted Index to improve accuracy while maintaining short retrieval time. Experiments conducted on several benchmark datasets (Oxford Building 5K, Oxford Building 5K+100K and Paris 6K) demonstrate the effectiveness of our proposed approach.

Download Full-text

Online Social Networks (OSN) Evolution Model Based on Homophily and Preferential Attachment

Symmetry ◽

10.3390/sym10110654 ◽

2018 ◽

Vol 10 (11) ◽

pp. 654 ◽

Cited By ~ 1

Author(s):

Jebran Khan ◽

Sungchang Lee

Keyword(s):

Social Networks ◽

Structural Properties ◽

Online Social Networks ◽

Large Scale ◽

Preferential Attachment ◽

Real Life ◽

Synthetic Data ◽

Evolution Model ◽

Scale Invariant ◽

Scale Free

In this paper, we propose a new scale-free social networks (SNs) evolution model that is based on homophily combined with preferential attachments. Our model enables the SN researchers to generate SN synthetic data for the evaluation of multi-facet SN models that are dependent on users’ attributes and similarities. Homophily is one of the key factors for interactive relationship formation in SN. The synthetic graph generated by our model is scale-invariant and has symmetric relationships. The model is dynamic and sustainable to changes in input parameters, such as number of nodes and nodes’ attributes, by conserving its structural properties. Simulation and evaluation of models for large-scale SN applications need large datasets. One way to get SN data is to generate synthetic data by using SN evolution models. Various SN evolution models are proposed to approximate the real-life SN graphs in previous research. These models are based on SN structural properties such as preferential attachment. The data generated by these models is suitable to evaluate SN models that are structure dependent but not suitable to evaluate models which depend on the SN users’ attributes and similarities. In our proposed model, users’ attributes and similarities are utilized to synthesize SN graphs. We evaluated the resultant synthetic graph by analyzing its structural properties. In addition, we validated our model by comparing its measures with the publicly available real-life SN datasets and previous SN evolution models. Simulation results show our resultant graph to be a close representation of real-life SN graphs with users’ attributes.

Download Full-text

RNA splicing analysis using heterogeneous and large RNA-seq1datasets

10.1101/2021.11.03.467086 ◽

2021 ◽

Author(s):

Jorge Vaquero-Garcia ◽

Joseph K Aicher ◽

Paul Jewell ◽

Matthew R K Gazzara ◽

Caleb Matthew Radens ◽

...

Keyword(s):

Rna Splicing ◽

Large Scale ◽

Splice Variants ◽

Synthetic Data ◽

Large Datasets ◽

Splicing Regulation ◽

Rna Seq ◽

Differential Splicing ◽

Experimental Conditions ◽

Benchmark Datasets

The ubiquity of RNA-seq has led to many methods that use RNA-seq data to analyze variations in RNA splicing. However, available methods are not well suited for handling heterogeneous and large datasets. Such datasets scale to housands of samples across dozens of experimental conditions, exhibit increased variability compared to biological replicates, and involve thousands of unannotated splice variants resulting in increased transcriptome complexity. We describe here a suite of algorithms and tools implemented in the MAJIQ v2 package to address challenges in detection, quantification, and visualization of splicing variations from such datasets. Using both large scale synthetic data and GTEx v8 as benchmark datasets, we demonstrate that the approaches in MAJIQ v2 outperform existing methods. We then apply MAJIQ v2 package to analyze differential splicing across 2,335 samples from 13 brain subregions, demonstrating its ability to offer new insights into brain subregion-specific splicing regulation.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Robots for Elderly Care in the Home: A Landscape Analysis and Co-Design Toolkit

International Journal of Social Robotics ◽

10.1007/s12369-021-00816-3 ◽

2021 ◽

Author(s):

Gianluca Bardaro ◽

Alessio Antonini ◽

Enrico Motta

Keyword(s):

Large Scale ◽

Daily Life ◽

Elderly Care ◽

Real Life ◽

Robotic Platform ◽

Holistic View ◽

Personal Assistants ◽

Healthcare Interventions ◽

The Impact ◽

The Eu

AbstractOver the last two decades, several deployments of robots for in-house assistance of older adults have been trialled. However, these solutions are mostly prototypes and remain unused in real-life scenarios. In this work, we review the historical and current landscape of the field, to try and understand why robots have yet to succeed as personal assistants in daily life. Our analysis focuses on two complementary aspects: the capabilities of the physical platform and the logic of the deployment. The former analysis shows regularities in hardware configurations and functionalities, leading to the definition of a set of six application-level capabilities (exploration, identification, remote control, communication, manipulation, and digital situatedness). The latter focuses on the impact of robots on the daily life of users and categorises the deployment of robots for healthcare interventions using three types of services: support, mitigation, and response. Our investigation reveals that the value of healthcare interventions is limited by a stagnation of functionalities and a disconnection between the robotic platform and the design of the intervention. To address this issue, we propose a novel co-design toolkit, which uses an ecological framework for robot interventions in the healthcare domain. Our approach connects robot capabilities with known geriatric factors, to create a holistic view encompassing both the physical platform and the logic of the deployment. As a case study-based validation, we discuss the use of the toolkit in the pre-design of the robotic platform for an pilot intervention, part of the EU large-scale pilot of the EU H2020 GATEKEEPER project.

Download Full-text

Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks

Big Data and Cognitive Computing ◽

10.3390/bdcc5010014 ◽

2021 ◽

Vol 5 (1) ◽

pp. 14

Author(s):

Christos Makris ◽

Georgios Pispirigos

Keyword(s):

Social Networks ◽

Graph Partitioning ◽

Large Scale ◽

Real Life ◽

Information Networks ◽

Digital Marketing ◽

Partitioning Problems ◽

Iterative Solutions ◽

Community Extraction ◽

Stability And Accuracy

Nowadays, due to the extensive use of information networks in a broad range of fields, e.g., bio-informatics, sociology, digital marketing, computer science, etc., graph theory applications have attracted significant scientific interest. Due to its apparent abstraction, community detection has become one of the most thoroughly studied graph partitioning problems. However, the existing algorithms principally propose iterative solutions of high polynomial order that repetitively require exhaustive analysis. These methods can undoubtedly be considered resource-wise overdemanding, unscalable, and inapplicable in big data graphs, such as today’s social networks. In this article, a novel, near-linear, and highly scalable community prediction methodology is introduced. Specifically, using a distributed, stacking-based model, which is built on plain network topology characteristics of bootstrap sampled subgraphs, the underlined community hierarchy of any given social network is efficiently extracted in spite of its size and density. The effectiveness of the proposed methodology has diligently been examined on numerous real-life social networks and proven superior to various similar approaches in terms of performance, stability, and accuracy.

Download Full-text

SpecTalk: Conforming IoT Implementations to Sensor Specifications

Sensors ◽

10.3390/s21165260 ◽

2021 ◽

Vol 21 (16) ◽

pp. 5260

Author(s):

Yi-Bing Lin ◽

Sheng-Lin Chou

Keyword(s):

Large Scale ◽

Application Programming Interface ◽

Signal Frequency ◽

Visual Test ◽

Gray Area ◽

Control Signals ◽

Information And Communication ◽

Monitoring Camera ◽

Self Test ◽

Iot Devices

Due to the fast evolution of Sensor and Internet of Things (IoT) technologies, several large-scale smart city applications have been commercially developed in recent years. In these developments, the contracts are often disputed in the acceptance due to the fact that the contract specification is not clear, resulting in a great deal of discussion of the gray area. Such disputes often occur in the acceptance processes of smart buildings, mainly because most intelligent building systems are expensive and the operations of the sub-systems are very complex. This paper proposes SpecTalk, a platform that automatically generates the code to conform IoT applications to the Taiwan Association of Information and Communication Standards (TAICS) specifications. SpecTalk generates a program to accommodate the application programming interface of the IoT devices under test (DUTs). Then, the devices can be tested by SpecTalk following the TAICS data formats. We describe three types of tests: self-test, mutual-test, and visual test. A self-test involves the sensors and the actuators of the same DUT. A mutual-test involves the sensors and the actuators of different DUTs. A visual-test uses a monitoring camera to investigate the actuators of multiple DUTs. We conducted these types of tests in commercially deployed applications of smart campus constructions. Our experiments in the tests proved that SpecTalk is feasible and can effectively conform IoT implementations to TACIS specifications. We also propose a simple analytic model to select the frequency of the control signals for the input patterns in a SpecTalk test. Our study indicates that it is appropriate to select the control signal frequency, such that the inter-arrival time between two control signals is larger than 10 times the activation delay of the DUT.

Download Full-text

Overcome the Brightness and Jitter Noises in Video Inter-Frame Tampering Detection

Sensors ◽

10.3390/s21123953 ◽

2021 ◽

Vol 21 (12) ◽

pp. 3953

Author(s):

Han Pu ◽

Tianqiang Huang ◽

Bin Weng ◽

Feng Ye ◽

Chenbin Zhao

Keyword(s):

Detection Method ◽

Real Life ◽

Vital Role ◽

Video Forensics ◽

Flow Algorithm ◽

Benchmark Datasets ◽

Media Reports ◽

Intensity Normalization ◽

Inter Frame ◽

Stable Feature

Digital video forensics plays a vital role in judicial forensics, media reports, e-commerce, finance, and public security. Although many methods have been developed, there is currently no efficient solution to real-life videos with illumination noises and jitter noises. To solve this issue, we propose a detection method that adapts to brightness and jitter for video inter-frame forgery. For videos with severe brightness changes, we relax the brightness constancy constraint and adopt intensity normalization to propose a new optical flow algorithm. For videos with large jitter noises, we introduce motion entropy to detect the jitter and extract the stable feature of texture changes fraction for double-checking. Experimental results show that, compared with previous algorithms, the proposed method is more accurate and robust for videos with significant brightness variance or videos with heavy jitter on public benchmark datasets.

Download Full-text