Exploratory Only: A Tool for Large-Scale Exploratory Analyses

Mapping Intimacies ◽

10.31234/osf.io/37fvm ◽

2021 ◽

Author(s):

Jin Kim

Keyword(s):

Web Application ◽

Large Scale ◽

Behavioral Science ◽

A Priori ◽

R Package ◽

Data Set ◽

Mediation Analyses ◽

Follow Up Studies ◽

Minimal Effort

This article presents Exploratory Only: an intuitive tool for conducting large-scale exploratory analyses easily and quickly. Available in three forms (as a web application, standalone program, and R Package) and launched as a point-and-click interface, Exploratory Only allows researchers to conduct all possible correlation, moderation, and mediation analyses among selected variables in their data set with minimal effort and time. Compared to a popular alternative, SPSS, Exploratory Only is shown to be orders of magnitude easier and faster at conducting exploratory analyses. The article demonstrates how to use Exploratory Only and discusses the caveat to using it. As long as researchers use Exploratory Only as intended—to discover novel hypotheses to investigate in follow-up studies, rather than to confirm nonexistent a priori hypotheses (i.e., p-hacking)—Exploratory Only can promote progress in behavioral science by encouraging more exploratory analyses and therefore more discoveries.

Download Full-text

How to care for the children? The need for large scale follow-up studies

Human Reproduction ◽

10.1093/humrep/13.9.2347 ◽

1998 ◽

Vol 13 (9) ◽

pp. 2347-2349 ◽

Cited By ~ 5

Author(s):

A. Brewaeys

Keyword(s):

Large Scale ◽

Follow Up Studies

Download Full-text

Joint DNA-based Disaster Victim Identification

10.21203/rs.3.rs-296414/v1 ◽

2021 ◽

Author(s):

Magnus Dehli Vigeland ◽

Thore Egeland

Keyword(s):

A Priori ◽

Search Space ◽

R Package ◽

Individual Identification ◽

Disaster Victim Identification ◽

Data Set ◽

Victim Identification ◽

Joint Identification ◽

Available Information ◽

User Friendly

Abstract We address computational and statistical aspects of DNA-based identification of victims in the aftermath of disasters. Current methods and software for such identification typically consider each victim individually, leading to suboptimal power of identification and potential inconsistencies in the statistical summary of the evidence. We resolve these problems by performing joint identification of all victims, using the complete genetic data set. Individual identification probabilities, conditional on all available information, are derived from the joint solution in the form of posterior pairing probabilities. A closed formula is obtained for the a priori number of possible joint solutions to a given DVI problem. This number increases quickly with the number of victims and missing persons, posing computational challenges for brute force approaches. We address this complexity with a preparatory sequential step aiming to reduce the search space. The examples show that realistic cases are handled efficiently. User-friendly implementations of all methods are provided in the R package dvir, freely available on all platforms.

Download Full-text

A geostatistical approach to multisensor rain field reconstruction and downscaling

Hydrology and Earth System Sciences ◽

10.5194/hess-5-201-2001 ◽

2001 ◽

Vol 5 (2) ◽

pp. 201-213 ◽

Cited By ~ 9

Author(s):

P. Fiorucci ◽

P. La Barbera ◽

L.G. Lanza ◽

R. Minciardi

Keyword(s):

Large Scale ◽

A Priori ◽

Grid Cell ◽

Ground Truth ◽

Rain Gauge ◽

Rain Event ◽

Data Set ◽

Field Reconstruction ◽

Geostatistical Approach ◽

A Posteriori Estimates

Abstract. A rain field reconstruction and downscaling methodology is presented, which allows suitable integration of large scale rainfall information and rain-gauge measurements at the ground. The former data set is assumed to provide probabilistic indicators that are used to infer the parameters of the probability density function of the stochastic rain process at each pixel site. Rain-gauge measurements are assumed as the ground truth and used to constrain the reconstructed rain field to the associated point values. Downscaling is performed by assuming the a posteriori estimates of the rain figures at each grid cell as the a priori large-scale conditioning values for reconstruction of the rain field at finer scale. The case study of an intense rain event recently observed in northern Italy is presented and results are discussed with reference to the modelling capabilities of the proposed methodology. Keywords: Reconstruction, downscaling, remote sensing, geostatistics, Meteosat

Download Full-text

HCsnip: An R Package for Semi-Supervised Snipping of the Hierarchical Clustering Tree

Cancer Informatics ◽

10.4137/cin.s22080 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S22080 ◽

Cited By ~ 1

Author(s):

Askar Obulkasim ◽

Mark A. Van De Wiel

Keyword(s):

Hierarchical Clustering ◽

R Package ◽

Extraction Process ◽

Background Information ◽

High Dimensional ◽

Data Set ◽

Event Information ◽

Data Points ◽

Clustering Tree

Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that “haunted” high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can be combined with other algorithms that operate on detected clusters.

Download Full-text

Multi-resolution characterization of molecular taxonomies in bulk and single-cell transcriptomics data

10.1101/2020.11.05.370197 ◽

2020 ◽

Author(s):

Eric R. Reed ◽

Stefano Monti

Keyword(s):

Breast Cancer ◽

Single Cell ◽

Large Scale ◽

Recursive Partitioning ◽

A Priori ◽

Cost Effective ◽

R Package ◽

Cancer Tissue ◽

Data Types ◽

Transcriptomics Data

AbstractAs high-throughput genomics assays become more efficient and cost effective, their utilization has become standard in large-scale biomedical projects. These studies are often explorative, in that relationships between samples are not explicitly defined a priori, but rather emerge from data-driven discovery and annotation of molecular subtypes, thereby informing hypotheses and independent evaluation. Here, we present K2Taxonomer, a novel unsupervised recursive partitioning algorithm and associated R package that utilize ensemble learning to identify robust subgroups in a “taxonomy-like” structure (https://github.com/montilab/K2Taxonomer). K2Taxonomer was devised to accommodate different data paradigms, and is suitable for the analysis of both bulk and single-cell transcriptomics data. For each of these data types, we demonstrate the power of K2Taxonomer to discover known relationships in both simulated and human tissue data. We conclude with a practical application on breast cancer tumor infiltrating lymphocyte (TIL) single-cell profiles, in which we identified co-expression of translational machinery genes as a dominant transcriptional program shared by T cells subtypes, associated with better prognosis in breast cancer tissue bulk expression data.

Download Full-text

Envisat MIPAS measurements of CFC-11: retrieval, validation, and climatology

Atmospheric Chemistry and Physics Discussions ◽

10.5194/acpd-8-4561-2008 ◽

2008 ◽

Vol 8 (2) ◽

pp. 4561-4602 ◽

Cited By ~ 1

Author(s):

L. Hoffmann ◽

M. Kaufmann ◽

R. Spang ◽

R. Müller ◽

J. J. Remedios ◽

...

Keyword(s):

Large Scale ◽

Lower Stratosphere ◽

A Priori ◽

Michelson Interferometer ◽

Optimal Estimation ◽

Atmospheric Conditions ◽

Data Set ◽

Mixing Ratios ◽

Priori Information ◽

The Individual

Abstract. From July 2002 to March 2004 the Michelson Interferometer for Passive Atmospheric Sounding (MIPAS) aboard the European Space Agency's Environmental Satellite (Envisat) measured nearly continuously mid infrared limb radiance spectra. These measurements are utilised to retrieve the global distribution of the chlorofluorocarbon CFC-11 by applying a new fast forward model for Envisat MIPAS and an accompanying optimal estimation retrieval processor. A detailed analysis shows that the total retrieval errors of the individual CFC-11 volume mixing ratios are typically below 10% and that the systematic components are dominating. Contribution of a priori information to the retrieval results are less than 5 to 10%. The vertical resolution of the observations is about 3 to 4 km. The data are successfully validated by comparison with several other space experiments, an air-borne in-situ instrument, measurements from ground-based networks, and independent Envisat MIPAS analyses. The retrieval results from 425 000 Envisat MIPAS limb scans are compiled to provide a new climatological data set of CFC-11. The climatology shows significantly lower CFC-11 abundances in the lower stratosphere compared with the Reference Atmospheres for MIPAS (RAMstan V3.1) climatology. Depending on the atmospheric conditions the differences between the climatologies are up to 30 to 110 ppt (45 to 150%) at 19 to 27 km altitude. Additionally, time series of CFC-11 mean abundance and variability for five latitudinal bands are presented. The observed CFC-11 distributions can be explained by the residual mean circulation and large-scale eddy-transports in the upper troposphere and lower stratosphere. The new CFC-11 data set is well suited for further scientific studies.

Download Full-text

Follow up studies: a case for a standard minimum data set

Archives of Disease in Childhood - Fetal and Neonatal Edition ◽

10.1136/fn.76.1.f61 ◽

1997 ◽

Vol 76 (1) ◽

pp. F61-F63 ◽

Cited By ~ 22

Author(s):

A. Johnson

Keyword(s):

Minimum Data Set ◽

Data Set ◽

Minimum Data ◽

Follow Up Studies

Download Full-text

Disrupted Sense of Agency as a State Marker of First-Episode Schizophrenia: A Large-Scale Follow-Up Study

Frontiers in Psychiatry ◽

10.3389/fpsyt.2020.570570 ◽

2020 ◽

Vol 11 ◽

Author(s):

Eva Kozáková ◽

Eduard Bakštein ◽

Ondřej Havlíček ◽

Ondřej Bečev ◽

Pavel Knytl ◽

...

Keyword(s):

Large Scale ◽

Clinical Symptoms ◽

Early Stage ◽

A Priori ◽

Sense Of Agency ◽

First Episode ◽

Dependent Manner ◽

Self Monitoring ◽

First Episode Schizophrenia

Background: Schizophrenia is often characterized by a general disruption of self-processing and self-demarcation. Previous studies have shown that self-monitoring and sense of agency (SoA, i.e., the ability to recognize one's own actions correctly) are altered in schizophrenia patients. However, research findings are inconclusive in regards to how SoA alterations are linked to clinical symptoms and their severity, or cognitive factors.Methods: In a longitudinal study, we examined 161 first-episode schizophrenia patients and 154 controls with a continuous-report SoA task and a control task testing general cognitive/sensorimotor processes. Clinical symptoms were assessed with the Positive and Negative Syndrome Scale (PANSS).Results: In comparison to controls, patients performed worse in terms of recognition of self-produced movements even when controlling for confounding factors. Patients' SoA score correlated with the severity of PANSS-derived “Disorganized” symptoms and with a priori defined symptoms related to self-disturbances. In the follow-up, the changes in the two subscales were significantly associated with the change in SoA performance.Conclusion: We corroborated previous findings of altered SoA already in the early stage of schizophrenia. Decreased ability to recognize self-produced actions was associated with the severity of symptoms in two complementary domains: self-disturbances and disorganization. While the involvement of the former might indicate impairment in self-monitoring, the latter suggests the role of higher cognitive processes such as information updating or cognitive flexibility. The SoA alterations in schizophrenia are associated, at least partially, with the intensity of respective symptoms in a state-dependent manner.

Download Full-text

A rigorous method for integrating multiple heterogeneous databases in genetic studies

10.1101/2020.09.15.298505 ◽

2020 ◽

Author(s):

József Bukszár ◽

Edwin JCG van den Oord

Keyword(s):

Large Scale ◽

Posterior Probability ◽

R Package ◽

Heterogeneous Databases ◽

Data Set ◽

Genetic Studies ◽

Marker Selection ◽

Mathematical Formulas ◽

User Friendly ◽

Existing Data

ABSTRACTThe large number of existing databases provides a freely available independent source of information with a considerable potential to increase the likelihood of identifying genes for complex diseases. We developed a flexible framework for integrating such heterogeneous databases into novel large scale genetic studies and implemented the methods in a freely-available, user-friendly R package called MIND. For each marker, MIND computes the posterior probability that the marker has effect in the novel data collection based on the information in all available data. MIND 1) relies on a very general model, 2) is based on the mathematical formulas that provide us with the exact value of the posterior probability, and 3) has good estimation properties because of its very efficient parameterization. For an existing data set, only the ranks of the markers are needed, where ties among the ranks are allowed. Through simulations, cross-validation analyses involving 18 GWAS, and an independent replication study of 6,544 SNPs in 6,298 samples we show that MIND 1) is accurate, 2) outperforms marker selection for follow up studies based on p-values, and 3) identifies effects that would otherwise require replication of over 20 times as many markers.AUTHOR SUMMARYThe large number of existing databases provides a freely available independent source of information with a considerable potential to increase the likelihood of identifying genes for complex diseases. We developed a flexible framework for integrating such heterogeneous databases into novel large scale genetic studies and implemented the methods in a freely-available, user-friendly R package called MIND. For each marker, MIND computes an estimate of the (posterior) probability that the marker has effect in the novel data collection based on the information in all available data. For an existing data set, only the ranks of the markers are needed to be known, where ties among the ranks are allowed. MIND 1) relies on a realistic model that takes confounding effects into account, 2) is based on the mathematical formulas that provide us with the exact value of the posterior probability, and 3) has good estimation properties because of its very efficient parameterization. Simulation, validation, and a replication study in independent samples show that MIND is accurate and greatly outperforms marker selection without using existing data sets.

Download Full-text

Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1314814111 ◽

2014 ◽

Vol 111 (46) ◽

pp. 16262-16267 ◽

Cited By ~ 28

Author(s):

Ruth Heller ◽

Marina Bogomolov ◽

Yoav Benjamini

Keyword(s):

Large Scale ◽

Follow Up Studies

Download Full-text