Comparison of marker selection methods for high throughput scRNA-seq data

Mapping Intimacies ◽

10.1101/679761 ◽

2019 ◽

Author(s):

Anna C. Gilbert ◽

Alexander Vargo

Keyword(s):

Performance Measures ◽

Synthetic Data ◽

Large Data ◽

Ground Truth ◽

Selection Method ◽

Large Data Sets ◽

Data Sets ◽

Selection Methods ◽

Marker Selection

AbstractHere, we evaluate the performance of a variety of marker selection methods on scRNA-seq UMI counts data. We test on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. In addition, we propose several performance measures for evaluating the quality of a set of markers when there is no known ground truth. According to these metrics, most existing marker selection methods show similar performance on experimental scRNA-seq data; thus, the speed of the algorithm is the most important consid-eration for large data sets. With this in mind, we introduce RANKCORR, a fast marker selection method with strong mathematical underpinnings that takes a step towards sensible multi-class marker selection.

Download Full-text

A FAST IMPLEMENTATION OF THE ISODATA CLUSTERING ALGORITHM

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195907002252 ◽

2007 ◽

Vol 17 (01) ◽

pp. 71-103 ◽

Cited By ~ 93

Author(s):

NARGESS MEMARSADEGHI ◽

DAVID M. MOUNT ◽

NATHAN S. NETANYAHU ◽

JACQUELINE LE MOIGNE

Keyword(s):

Clustering Algorithm ◽

Empirical Studies ◽

Synthetic Data ◽

Large Data ◽

Large Data Sets ◽

Cluster Center ◽

Data Sets ◽

Clustering Methods ◽

Sensing Applications ◽

Remote Sensing Applications

Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.

Download Full-text

The applying of machine learning methods to improve the quality of well casing

Oil and Gas Studies ◽

10.31660/0445-0108-2020-5-81-93 ◽

2020 ◽

pp. 81-93

Author(s):

D. V. Shalyapin ◽

D. L. Bakirov ◽

M. M. Fattakhov ◽

A. D. Shalyapina ◽

A. V. Melekhov ◽

...

Keyword(s):

Oil And Gas ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

New Approach ◽

Geological Conditions ◽

Expert Assessments ◽

The Impact ◽

The Relationship

The article is devoted to the quality of well casing at the Pyakyakhinskoye oil and gas condensate field. The issue of improving the quality of well casing is associated with many problems, for example, a large amount of work on finding the relationship between laboratory studies and actual data from the field; the difficulty of finding logically determined relationships between the parameters and the final quality of well casing. The text gives valuable information on a new approach to assessing the impact of various parameters, based on a mathematical apparatus that excludes subjective expert assessments, which in the future will allow applying this method to deposits with different rock and geological conditions. We propose using the principles of mathematical processing of large data sets applying neural networks trained to predict the characteristics of the quality of well casing (continuity of contact of cement with the rock and with the casing). Taking into account the previously identified factors, we developed solutions to improve the tightness of the well casing and the adhesion of cement to the limiting surfaces.

Download Full-text

A rank-based marker selection method for high throughput scRNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03641-z ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Alexander H. S. Vargo ◽

Anna C. Gilbert

Keyword(s):

Single Cell ◽

High Throughput ◽

Parametric Method ◽

Ground Truth ◽

Specific Cell ◽

Data Sets ◽

Fast Method ◽

Selection Methods ◽

Single Experiment ◽

Marker Selection

Abstract Background High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. Results We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. Conclusions According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorrwith extensive documentation.

Download Full-text

Location-Based Collaborative Filtering for Web Service Recommendation

Recent Patents on Computer Science ◽

10.2174/2213275911666181025130059 ◽

2019 ◽

Vol 12 (1) ◽

pp. 34-40

Author(s):

Mareeswari Venkatachalaappaswamy ◽

Vijayan Ramaraj ◽

Saranya Ravichandran

Keyword(s):

Collaborative Filtering ◽

Web Service ◽

Information Filtering ◽

Service Selection ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Service Recommendation ◽

Location Aware

Background: In many modern applications, information filtering is now used that exposes users to a collection of data. In such systems, the users are provided with recommended items’ list they might prefer or predict the rate that they might prefer for the items. So that, the users might be select the items that are preferred in that list. Objective: In web service recommendation based on Quality of Service (QoS), predicting QoS value will greatly help people to select the appropriate web service and discover new services. Methods: The effective method or technique for this would be Collaborative Filtering (CF). CF will greatly help in service selection and web service recommendation. It is the more general way of information filtering among the large data sets. In the narrower sense, it is the method of making predictions about a user’s interest by collecting taste information from many users. Results: It is easy to build and also much more effective for recommendations by predicting missing QoS values for the users. It also addresses the scalability problem since the recommendations are based on like-minded users using PCC or in clusters using KNN rather than in large data sources. Conclusion: In this paper, location-aware collaborative filtering is used to recommend the services. The proposed system compares the prediction outcomes and execution time with existing algorithms.

Download Full-text

On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019841681 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1140-1158 ◽

Cited By ~ 3

Author(s):

Ana Gainaru ◽

Hongyang Sun ◽

Guillaume Aupy ◽

Yuankai Huo ◽

Bennett A Landman ◽

...

Keyword(s):

Performance Measures ◽

Data Centers ◽

Large Data ◽

Large Data Sets ◽

System Level ◽

Data Sets ◽

Data Set ◽

System Utilization ◽

Average Stretch ◽

Level Performance

Scientific insights in the coming decade will clearly depend on the effective processing of large data sets generated by dynamic heterogeneous applications typical of workflows in large data centers or of emerging fields like neuroscience. In this article, we show how these big data workflows have a unique set of characteristics that pose challenges for leveraging HPC methodologies, particularly in scheduling. Our findings indicate that execution times for these workflows are highly unpredictable and are not correlated with the size of the data set involved or the precise functions used in the analysis. We characterize this inherent variability and sketch the need for new scheduling approaches by quantifying significant gaps in achievable performance. Through simulations, we show how on-the-fly scheduling approaches can deliver benefits in both system-level and user-level performance measures. On average, we find improvements of up to 35% in system utilization and up to 45% in average stretch of the applications, illustrating the potential of increasing performance through new scheduling approaches.

Download Full-text

Inherent Limitations of Hospital Death Rates to Assess Quality

International Journal of Technology Assessment in Health Care ◽

10.1017/s026646230000074x ◽

1990 ◽

Vol 6 (2) ◽

pp. 220-228 ◽

Cited By ~ 9

Author(s):

Robert W. Dubois

Keyword(s):

Death Rate ◽

Large Data ◽

Potential Method ◽

Large Data Sets ◽

Hospital Death ◽

Data Sets ◽

Death Rates ◽

Rate Study ◽

Assess Quality

AbstractModeling death rates has been suggested as a potential method to screen hospitals and identify superior and substandard providers. This article begins with a review of one hospital death rate study and focuses upon its findings and limitations. It also explores the inherent limitations in the use of large data sets to assess quality of care.

Download Full-text

Current Uses of Large Data Sets to Assess the Quality of Providers: Construction of Risk-Adjusted Indexes of Hospital Performance

International Journal of Technology Assessment in Health Care ◽

10.1017/s0266462300000751 ◽

1990 ◽

Vol 6 (2) ◽

pp. 229-238 ◽

Cited By ~ 16

Author(s):

Susan Desharnais

Keyword(s):

Health Policy ◽

Large Data ◽

Patient Discharge ◽

Hospital Performance ◽

Large Data Sets ◽

Data Sets ◽

Policy Changes

AbstractThis article examines how large data sets can be used for evaluating the effects of health policy changes and for flagging providers with potential quality problems. An example is presented, illustrating how three risk-adjusted measures of hospital performance were developed using patient discharge abstracts. Advantages and disadvantage of this approach are discussed.

Download Full-text

Classification-based Diagnosis Using Synthetic Data from Uncertain Models

Annual Conference of the PHM Society ◽

10.36001/phmconf.2018.v10i1.251 ◽

2018 ◽

Vol 10 (1) ◽

Author(s):

Ion Matei ◽

Maksym Zhenirovskyy ◽

Johan De Kleer ◽

Alexander Feldman

Keyword(s):

Network Architecture ◽

Synthetic Data ◽

Large Data ◽

Large Data Sets ◽

Training Data ◽

Generalize Polynomial ◽

Data Sets ◽

Neural Network Architecture ◽

Mode Of Operation ◽

Parameter Values

Machine learning based diagnosis engines require large data sets for training. When experimental data is insucient, system models can be used to supplement the data. Such models are typically simplified and imprecise, hence with some degree of uncertainty. In this paper we show how to deal with uncertainty in synthetic training data. The data is produced using a model with uncertainties. The uncertainties originate from inaccurate parameter values or parameters that take dierent values based on the mode of operation. We demonstrate how techniques from the uncertainty quantification field can be used to reduce the numerical complexity of the training algorithm. In particular, we use generalize polynomial chaos to eciently approximate the loss function. In addition, we present a neural network architecture specifically designed to deal with uncertainties in the training data. As an illustrative example, we show how our approach can be used to detect faults in an elevator system.

Download Full-text

DATA MINING KLASTERISASI PENJUALAN ALAT-ALAT BANGUNAN MENGGUNAKAN METODE K-MEANS (STUDI KASUS DI TOKO ADI BANGUNAN)

JURNAL TEKNOLOGI DAN OPEN SOURCE ◽

10.36378/jtos.v1i2.24 ◽

2018 ◽

Vol 1 (2) ◽

pp. 83-91

Author(s):

M. Hasyim Siregar

Keyword(s):

Data Mining ◽

Cost Reduction ◽

Building Materials ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Operational Cost ◽

Use Of Data

In the world of business competition today, we are required to continually develop business to always survive in the competition. To achieve this there are a few things that can be done is to improve the quality of the product, adding the type of product and operational cost reduction company with how to use data analysis of the company. Data mining is a technology that automate the process to find interesting patterns and sensitive from the large data sets. This allows human understanding about finding patterns and scalability techniques. The store Adi Bangunan is a shop which is engaged in the sale of building materials and household who have such a system on supermarket namely buyers took own goods that will be purchased. Sales data, purchase goods or reimbursed some unexpected is not well ordered, so that the data is only function as archive for the store and cannot be used for the development of marketing strategy. In this research, data mining applied using the model of the process of K-Means that provides a standard process for the use of data mining in various areas used in the classification of because the results of this method can be easily understood and interpreted.

Download Full-text

Max stable set problem to found the initial centroids in clustering problem

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v25.i1.pp569-579 ◽

2022 ◽

Vol 25 (1) ◽

pp. 569

Author(s):

Awatif Karim ◽

Chakir Loqman ◽

Youssef Hami ◽

Jaouad Boumhidi

Keyword(s):

Document Clustering ◽

Large Data ◽

Hopfield Network ◽

Large Data Sets ◽

Stable Set ◽

Data Sets ◽

Clustering Problem ◽

Text Document ◽

Stable Set Problem

In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.

Download Full-text