Embedding to Reference t-SNE Space Addresses Batch Effects in Single-Cell Classification

Mapping Intimacies ◽

10.1101/671404 ◽

2019 ◽

Cited By ~ 2

Author(s):

Pavlin G. Poličar ◽

Martin Stražar ◽

Blaž Zupan

Keyword(s):

Single Cell ◽

Secondary Data ◽

Primary Data ◽

Data Sets ◽

Batch Effects ◽

Data Set ◽

Reduction Techniques ◽

Straightforward Application ◽

Cell Gene Expression ◽

Multiple Data Sets

AbstractDimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When working with multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose data set-specific clusters. To circumvent these batch effects, we propose an embedding procedure that takes a t-SNE visualization constructed on a reference data set and uses it as a scaffold for embedding new data. The new, secondary data is embedded one data-point at the time. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach with an analysis of six recently published single-cell gene expression data sets containing up to tens of thousands of cells and thousands of genes. In these data sets, the batch effects are particularly strong as the data comes from different institutions and was obtained using different experimental protocols. The visualizations constructed by our proposed approach are cleared of batch effects, and the cells from secondary data sets correctly co-cluster with cells from the primary data sharing the same cell type.

Download Full-text

Embedding to reference t-SNE space addresses batch effects in single-cell classification

Machine Learning ◽

10.1007/s10994-021-06043-1 ◽

2021 ◽

Author(s):

Pavlin G. Poličar ◽

Martin Stražar ◽

Blaž Zupan

Keyword(s):

Single Cell ◽

Single Cells ◽

Secondary Data ◽

Machine Learning Techniques ◽

Primary Data ◽

Data Sets ◽

Batch Effects ◽

Data Set ◽

Straightforward Application ◽

Multiple Data Sets

AbstractDimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When jointly visualising multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose dataset-specific clusters. To circumvent these batch effects, we propose an embedding procedure that uses a t-SNE visualization constructed on a reference data set as a scaffold for embedding new data points. Each data instance from a new, unseen, secondary data is embedded independently and does not change the reference embedding. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach by analyzing six recently published single-cell gene expression data sets with up to tens of thousands of cells and thousands of genes. The batch effects in our studies are particularly strong as the data comes from different institutions using different experimental protocols. The visualizations constructed by our proposed approach are clear of batch effects, and the cells from secondary data sets correctly co-cluster with cells of the same type from the primary data. We also show the predictive power of our simple, visual classification approach in t-SNE space matches the accuracy of specialized machine learning techniques that consider the entire compendium of features that profile single cells.

Download Full-text

PCQC: Selecting optimal principal components for identifying clusters with highly imbalanced class sizes in single-cell RNA-seq data

10.1101/2020.11.19.390542 ◽

2020 ◽

Author(s):

David Burstein ◽

John F. Fullard ◽

Panos Roussos

Keyword(s):

Single Cell ◽

Principal Components ◽

Data Sets ◽

Sequencing Data ◽

Computationally Efficient ◽

Data Set ◽

Cell Gene Expression ◽

Efficient Alternative ◽

Small Clusters ◽

Variance Explained

AbstractSummaryPrior to identifying clusters in single cell gene expression experiments, selecting the top principal components is a critical step for filtering out noise in the data set. Identifying these top principal components typically focuses on the total variance explained, and principal components that explain small clusters from rare populations will not necessarily capture a large percentage of variance in the data. We present a computationally efficient alternative for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation. We then evaluate the efficacy of our approach in three different single cell RNA-sequencing data sets and find that our method matches, or outperforms, other selection criteria that are typically employed in the literature.Availability and implementationpcqc is written in Python and available at github.com/RoussosLab/pcqc

Download Full-text

Panoramic stitching of heterogeneous single-cell transcriptomic data

10.1101/371179 ◽

2018 ◽

Cited By ~ 17

Author(s):

Brian Hie ◽

Bryan Bryson ◽

Bonnie Berger

Keyword(s):

Single Cell ◽

Cell Types ◽

Data Sets ◽

Cell Type ◽

Data Set ◽

Wide Range ◽

Data Set Integration ◽

Biological Patterns ◽

Insight Into ◽

Comprehensive Reference

AbstractResearchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 different technologies into a single comprehensive reference, demonstrating how Scanorama can be used to obtain a more complete picture of cellular function across a wide range of scRNA-seq experiments.

Download Full-text

Multiple Data Set ILI for Mechanical Damage Assessment

Volume 2: Pipeline Integrity Management ◽

10.1115/ipc2012-90244 ◽

2012 ◽

Author(s):

Chris Goller ◽

James Simek ◽

Jed Ludlow

Keyword(s):

Damage Assessment ◽

Public Awareness ◽

Mechanical Damage ◽

Data Sets ◽

Data Set ◽

Multiple Data ◽

Pipe Joints ◽

Multiple Data Sets ◽

Multiple Field ◽

Flux Leakage

The purpose of this paper is to present a non-traditional pipeline mechanical damage ranking system using multiple-data-set in-line inspection (ILI) tools. Mechanical damage continues to be a major factor in reportable incidents for hazardous liquid and gas pipelines. While several ongoing programs seek to limit damage incidents through public awareness, encroachment monitoring, and one-call systems, others have focused efforts on the quantification of mechanical damage severity through modeling, the use of ILI tools, and subsequent feature assessment at locations selected for excavation. Current generation ILI tools capable of acquiring multiple-data-sets in a single survey may provide an improved assessment of the severity of damaged zones using methods developed in earlier research programs as well as currently reported information. For magnetic flux leakage (MFL) type tools, using multiple field levels, varied field directions, and high accuracy deformation sensors enables detection and provides the data necessary for enhanced severity assessments. This paper will provide a review of multiple-data-set ILI results from several pipe joints with simulated mechanical damage locations created mimicing right-of-way encroachment events in addition to field results from ILI surveys using multiple-data-set tools.

Download Full-text

Improving Quality and Safety Through Use of Secondary Data: Methods Case Study

Western Journal of Nursing Research ◽

10.1177/0193945916672449 ◽

2016 ◽

Vol 39 (11) ◽

pp. 1477-1501 ◽

Cited By ~ 1

Author(s):

Victoria Goode ◽

Nancy Crego ◽

Michael P. Cary ◽

Deirdre Thornlow ◽

Elizabeth Merwin

Keyword(s):

Research Question ◽

Secondary Data ◽

Data Access ◽

Data Sets ◽

Complex Data ◽

Management Skills ◽

Data Set ◽

Large Numbers ◽

Need To Evaluate

Researchers need to evaluate the strengths and weaknesses of data sets to choose a secondary data set to use for a health care study. This research method review informs the reader of the major issues necessary for investigators to consider while incorporating secondary data into their repertoire of potential research designs and shows the range of approaches the investigators may take to answer nursing research questions in a variety of context areas. The researcher requires expertise in locating and judging data sets and in the development of complex data management skills for managing large numbers of records. There are important considerations such as firm knowledge of the research question supported by the conceptual framework and the selection of appropriate databases, which guide the researcher in delineating the unit of analysis. Other more complex issues for researchers to consider when conducting secondary data research methods include data access, management and security, and complex variable construction.

Download Full-text

Data Fusion Using a Multi-Sensor Sparse-Based Clustering Algorithm

Remote Sensing ◽

10.3390/rs12234007 ◽

2020 ◽

Vol 12 (23) ◽

pp. 4007

Author(s):

Kasra Rafiezadeh Shahi ◽

Pedram Ghamisi ◽

Behnood Rasti ◽

Robert Jackisch ◽

Paul Scheunders ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Hyperspectral Data ◽

Sensor Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Multiple Data Sets ◽

Imaging Sensors

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.

Download Full-text

Estimating observation and model error variances using multiple data sets

Atmospheric Measurement Techniques ◽

10.5194/amt-11-4239-2018 ◽

2018 ◽

Vol 11 (7) ◽

pp. 4239-4260 ◽

Cited By ~ 8

Author(s):

Richard Anthes ◽

Therese Rieckh

Keyword(s):

Error Variance ◽

Specific Humidity ◽

Data Sets ◽

Data Set ◽

Multiple Data ◽

Gfs Model ◽

Multiple Data Sets ◽

Using Data ◽

The Tropics ◽

Estimated Error

Abstract. In this paper we show how multiple data sets, including observations and models, can be combined using the “three-cornered hat” (3CH) method to estimate vertical profiles of the errors of each system. Using data from 2007, we estimate the error variances of radio occultation (RO), radiosondes, ERA-Interim, and Global Forecast System (GFS) model data sets at four radiosonde locations in the tropics and subtropics. A key assumption is the neglect of error covariances among the different data sets, and we examine the consequences of this assumption on the resulting error estimates. Our results show that different combinations of the four data sets yield similar relative and specific humidity, temperature, and refractivity error variance profiles at the four stations, and these estimates are consistent with previous estimates where available. These results thus indicate that the correlations of the errors among all data sets are small and the 3CH method yields realistic error variance profiles. The estimated error variances of the ERA-Interim data set are smallest, a reasonable result considering the excellent model and data assimilation system and assimilation of high-quality observations. For the four locations studied, RO has smaller error variances than radiosondes, in agreement with previous studies. Part of the larger error variance of the radiosondes is associated with representativeness differences because radiosondes are point measurements, while the other data sets represent horizontal averages over scales of ∼ 100 km.

Download Full-text

Moisture estimation within a mine heap: An application of cokriging with assay data and electrical resistivity

Geophysics ◽

10.1190/1.3277266 ◽

2010 ◽

Vol 75 (1) ◽

pp. B11-B23 ◽

Cited By ~ 9

Author(s):

Dale Rucker

Keyword(s):

Electrical Resistivity ◽

Secondary Data ◽

Estimation Procedure ◽

Primary Data ◽

Low Grade ◽

Least Squares Regression ◽

Data Set ◽

Correlation Scale ◽

The Mean ◽

Moisture Estimation

Cokriging has been applied to estimate the distribution of moisture within a rock pile of low-grade gold ore, or heap. Along with the primary data set of gravimetric moisture content obtained from drilling, electrical resistivity was used to supplement the estimation procedure by supplying a secondary data set. The effectiveness of the cokriging method was determined by comparing the results to kriging the moisture data alone and through least-squares regression (LSR) modeling of colocated resistivity and moisture. In general, the wells from which moisture data were derived were separated by distances far greater than the horizontal correlation scale. The kriging results showed that regions generally undersampled by drilling reverted to the mean of the moisture data. The LSR technique, which provides a simpletransformation of resistivity to moisture, converted the low resis-tivity to highmoisture, and vice versa. The sparse well locations created a high degree of uncertainty in the transformed data set. Extreme resistivity values produced nonphysical moisture values, either negative for the linear model or values greater than one for the power model. The cokriging application, which considers the correlation scale and secondary data, produced the best results, as indicated through the cross validation. The mean and variance of the cokriged moisture were closer to the measured moisture, and the bias in the residuals was the lowest. The application likely could be improved through optimal well placement, whereby the resistivity results guide the drilling program through gross target characterization, and the moisture estimation could be updated iteratively.

Download Full-text

Probabilistic Harmonization and Annotation of Single-cell Transcriptomics Data with Deep Generative Models

10.1101/532895 ◽

2019 ◽

Cited By ~ 14

Author(s):

Chenling Xu ◽

Romain Lopez ◽

Edouard Mehlman ◽

Jeffrey Regier ◽

Michael I. Jordan ◽

...

Keyword(s):

Single Cell ◽

Probabilistic Approach ◽

Cell Types ◽

Generative Models ◽

Marker Genes ◽

Data Sets ◽

Data Set ◽

Cell State ◽

Transcriptomics Data ◽

Single Data

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

Download Full-text

CIM-seq

10.21203/rs.3.pex-1365/v1 ◽

2021 ◽

Author(s):

Nathanael Andrews ◽

Martin Enge

Keyword(s):

Single Cell ◽

Single Cells ◽

Likelihood Estimation ◽

Cell Types ◽

Data Sets ◽

Target Tissue ◽

Data Set ◽

Rnaseq Data ◽

The Given ◽

Cell Data

Abstract CIM-seq is a tool for deconvoluting RNA-seq data from cell multiplets (clusters of two or more cells) in order to identify physically interacting cell in a given tissue. The method requires two RNAseq data sets from the same tissue: one of single cells to be used as a reference, and one of cell multiplets to be deconvoluted. CIM-seq is compatible with both droplet based sequencing methods, such as Chromium Single Cell 3′ Kits from 10x genomics; and plate based methods, such as Smartseq2. The pipeline consists of three parts: 1) Dissociation of the target tissue, FACS sorting of single cells and multiplets, and conventional scRNA-seq 2) Feature selection and clustering of cell types in the single cell data set - generating a blueprint of transcriptional profiles in the given tissue 3) Computational deconvolution of multiplets through a maximum likelihood estimation (MLE) to determine the most likely cell type constituents of each multiplet.

Download Full-text