MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data

Computational Intelligence and Neuroscience ◽

10.1155/2015/217216 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 4

Author(s):

Jingjing Wang ◽

Chen Lin

Keyword(s):

Large Scale ◽

False Negative ◽

Experimental Studies ◽

Simulated Data ◽

False Positives ◽

Locality Sensitive Hashing ◽

False Negatives ◽

Large Scale Data ◽

Similarity Joins ◽

Scale Data

Locality Sensitive Hashing (LSH) has been proposed as an efficient technique for similarity joins for high dimensional data. The efficiency and approximation rate of LSH depend on the number of generated false positive instances and false negative instances. In many domains, reducing the number of false positives is crucial. Furthermore, in some application scenarios, balancing false positives and false negatives is favored. To address these problems, in this paper we propose Personalized Locality Sensitive Hashing (PLSH), where a new banding scheme is embedded to tailor the number of false positives, false negatives, and the sum of both. PLSH is implemented in parallel using MapReduce framework to deal with similarity joins on large scale data. Experimental studies on real and simulated data verify the efficiency and effectiveness of our proposed PLSH technique, compared with state-of-the-art methods.

Download Full-text

An Evaluation of Supervised Dimensionality Reduction For Large Scale Data

Journal of Machine and Computing ◽

10.53759/7669/jmc202202003 ◽

2022 ◽

pp. 17-25

Author(s):

Nancy Jan Sliper

Keyword(s):

Dimensionality Reduction ◽

Large Scale ◽

Simulated Data ◽

Principal Component ◽

Low Rank ◽

Learning Tools ◽

Large Scale Data ◽

Reduction Methods ◽

Low Dimensional ◽

Scale Data

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.

Download Full-text

PRSice-2: Polygenic Risk Score software for biobank-scale data

GigaScience ◽

10.1093/gigascience/giz082 ◽

2019 ◽

Vol 8 (7) ◽

Cited By ~ 106

Author(s):

Shing Wan Choi ◽

Paul F O'Reilly

Keyword(s):

Risk Score ◽

Large Scale ◽

Experimental Studies ◽

Polygenic Risk Score ◽

Phenotypic Data ◽

Polygenic Risk ◽

Large Scale Data ◽

Or Gene ◽

Scalable Methods ◽

Scale Data

Abstract Background Polygenic risk score (PRS) analyses have become an integral part of biomedical research, exploited to gain insights into shared aetiology among traits, to control for genomic profile in experimental studies, and to strengthen causal inference, among a range of applications. Substantial efforts are now devoted to biobank projects to collect large genetic and phenotypic data, providing unprecedented opportunity for genetic discovery and applications. To process the large-scale data provided by such biobank resources, highly efficient and scalable methods and software are required. Results Here we introduce PRSice-2, an efficient and scalable software program for automating and simplifying PRS analyses on large-scale data. PRSice-2 handles both genotyped and imputed data, provides empirical association P-values free from inflation due to overfitting, supports different inheritance models, and can evaluate multiple continuous and binary target traits simultaneously. We demonstrate that PRSice-2 is dramatically faster and more memory-efficient than PRSice-1 and alternative PRS software, LDpred and lassosum, while having comparable predictive power. Conclusion PRSice-2's combination of efficiency and power will be increasingly important as data sizes grow and as the applications of PRS become more sophisticated, e.g., when incorporated into high-dimensional or gene set–based analyses. PRSice-2 is written in C++, with an R script for plotting, and is freely available for download from http://PRSice.info.

Download Full-text

FastClone is a probabilistic tool for deconvoluting tumor heterogeneity in bulk-sequencing samples

Nature Communications ◽

10.1038/s41467-020-18169-2 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Yao Xiao ◽

Xueqing Wang ◽

Hongjiu Zhang ◽

Peter J. Ulintz ◽

Hongyang Li ◽

...

Keyword(s):

Tumor Heterogeneity ◽

Large Scale ◽

Simulated Data ◽

Independent Copy ◽

Stage Iii Colon Cancer ◽

Large Scale Data ◽

Number Variation ◽

The Rich ◽

Chromosome Regions ◽

Scale Data

Abstract Dissecting tumor heterogeneity is a key to understanding the complex mechanisms underlying drug resistance in cancers. The rich literature of pioneering studies on tumor heterogeneity analysis spurred a recent community-wide benchmark study that compares diverse modeling algorithms. Here we present FastClone, a top-performing algorithm in accuracy in this benchmark. FastClone improves over existing methods by allowing the deconvolution of subclones that have independent copy number variation events within the same chromosome regions. We characterize the behavior of FastClone in identifying subclones using stage III colon cancer primary tumor samples as well as simulated data. It achieves approximately 100-fold acceleration in computation for both simulated and patient data. The efficacy of FastClone will allow its application to large-scale data and clinical data, and facilitate personalized medicine in cancers.

Download Full-text

Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data

Language Processing and Intelligent Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-642-38634-3_19 ◽

2013 ◽

pp. 171-178 ◽

Cited By ~ 11

Author(s):

Radosław Szmit

Keyword(s):

Similarity Search ◽

Large Scale ◽

Locality Sensitive Hashing ◽

Large Scale Data ◽

Scale Data

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

ProGen:Provenance database generator for large-scale data set

Journal of Computer Applications ◽

10.3724/sp.j.1087.2008.02737 ◽

2009 ◽

Vol 28 (11) ◽

pp. 2737-2740

Author(s):

Xiao ZHANG ◽

Shan WANG ◽

Na LIAN

Keyword(s):

Large Scale ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text

Construction of integrated particle rendering environment for large scale data visualization

Impact ◽

10.21820/23987073.2018.11.9 ◽

2018 ◽

Vol 2018 (11) ◽

pp. 9-11

Author(s):

Koji Koyamada

Keyword(s):

Data Visualization ◽

Large Scale ◽

Large Scale Data ◽

Scale Data

Download Full-text

COMMUNITY-CURATED DATA RESOURCES AND LARGE-SCALE DATA-MODEL SYNTHESES: THE CHILDREN OF COHMAP

10.1130/abs/2016am-286533 ◽

2016 ◽

Author(s):

John W. Williams ◽

◽

Simon Goring ◽

Eric Grimm ◽

Jason McLachlan

Keyword(s):

Data Model ◽

Large Scale ◽

Large Scale Data ◽

Scale Data

Download Full-text

Local and global approaches of affinity propagation clustering for large scale data

Journal of Zhejiang University SCIENCE A ◽

10.1631/jzus.a0720058 ◽

2008 ◽

Vol 9 (10) ◽

pp. 1373-1381 ◽

Cited By ~ 28

Author(s):

Ding-yin Xia ◽

Fei Wu ◽

Xu-qing Zhang ◽

Yue-ting Zhuang

Keyword(s):

Large Scale ◽

Affinity Propagation ◽

Large Scale Data ◽

Affinity Propagation Clustering ◽

Scale Data

Download Full-text