BisPin and BFAST-Gap: Mapping Bisulfite-Treated Reads

Mapping Intimacies ◽

10.1101/284596 ◽

2018 ◽

Author(s):

Jacob Porter ◽

Liqing Zhang

Keyword(s):

Area Under The Curve ◽

Simulated Data ◽

Real Data ◽

Logistic Function ◽

Ion Torrent ◽

Read Mapping ◽

Data Set ◽

Mapping Software ◽

Ion Torrent Sequencing ◽

Construction Strategies

AbstractBackgroundBisPin is a new multiprocess bisulfite-treated short DNA read mapper written in Python 2.7. It performs alignments using BFAST, leveraging its multithreading functionality and thorough hash-based indexing strategy. BisPin is feature rich and supports directional, nondirectional, PBAT, and hairpin construction strategies. BisPin approaches read mapping by converting the Cs to Ts and the Gs to As in both the reads and the reference genome. BisPin uses fast rescoring to disambiguate ambiguously aligned reads for a superior amount of uniquely mapped reads compared to other mappers. The performance of BisPin was evaluated on both real and simulated data in comparison to other read mappers.BFAST-Gap is a modified version of BFAST meant for Ion Torrent reads. It uses a parameterized logistic function to determine the weights of the gap open and extension penalties based on the homopolymer run length of the DNA read. This is because the Ion Torrent sequencing technology can overcall and undercall homopolymer runs. BisPin works with both BFAST-Gap and BFAST. BFAST-Gap is compatible with indexes built with BFAST. There are few mappers that specifically address Ion Torrent data. BFAST-Gap works with Illumina reads as well.ResultsBisPin with BFAST consistently had a higher amount of uniquely mapped reads compared to other mappers on real data using a variety of construction strategies. Using a hairpin validation strategy, BisPin was superior using the maximum score, and it mapped 73% of reads correctly.BisPin with BFAST-Gap on Ion Torrent reads with a logistic gap open penalty function improved mapping accuracy with real and simulated data. On simulated bisulfite Ion Torrent data, the area under the curve was improved by approximately seven, and on one real data set, the uniquely mapped percent was improved by seven percent. BFAST-Gap performed better than TMAP on simulated regular Ion Torrent reads, and TMAP is designed for Ion Torrent reads. Other read mappers had worse performance.ConclusionsBisPin and BFAST-Gap have consistently good accuracy with a variety of data. BisPin is feature-rich. This makes BisPin and BFAST-Gap useful additions to read mapping software.

Download Full-text

Classifying exoplanet candidates with convolutional neural networks: application to the Next Generation Transit Survey

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2058 ◽

2019 ◽

Vol 488 (4) ◽

pp. 5232-5250 ◽

Cited By ~ 2

Author(s):

Alexander Chaushev ◽

Liam Raynard ◽

Michael R Goad ◽

Philipp Eigmüller ◽

David J Armstrong ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Network Performance ◽

Area Under The Curve ◽

Simulated Data ◽

Real Data ◽

Training Data ◽

Next Generation ◽

Data Set ◽

Time Required

ABSTRACT Vetting of exoplanet candidates in transit surveys is a manual process, which suffers from a large number of false positives and a lack of consistency. Previous work has shown that convolutional neural networks (CNN) provide an efficient solution to these problems. Here, we apply a CNN to classify planet candidates from the Next Generation Transit Survey (NGTS). For training data sets we compare both real data with injected planetary transits and fully simulated data, as well as how their different compositions affect network performance. We show that fewer hand labelled light curves can be utilized, while still achieving competitive results. With our best model, we achieve an area under the curve (AUC) score of $(95.6\pm {0.2}){{\ \rm per\ cent}}$ and an accuracy of $(88.5\pm {0.3}){{\ \rm per\ cent}}$ on our unseen test data, as well as $(76.5\pm {0.4}){{\ \rm per\ cent}}$ and $(74.6\pm {1.1}){{\ \rm per\ cent}}$ in comparison to our existing manual classifications. The neural network recovers 13 out of 14 confirmed planets observed by NGTS, with high probability. We use simulated data to show that the overall network performance is resilient to mislabelling of the training data set, a problem that might arise due to unidentified, low signal-to-noise transits. Using a CNN, the time required for vetting can be reduced by half, while still recovering the vast majority of manually flagged candidates. In addition, we identify many new candidates with high probabilities which were not flagged by human vetters.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

SSP: An R package to estimate sampling effort in studies of ecological communities

10.1101/2020.03.19.996991 ◽

2020 ◽

Author(s):

Edlin J. Guerra-Castro ◽

Juan Carlos Cajas ◽

Nuno Simões ◽

Juan J Cruz-Motta ◽

Maite Mascaró

Keyword(s):

Simulated Data ◽

Real Data ◽

R Package ◽

Sampling Effort ◽

Ecological Communities ◽

Ecological Data ◽

Data Set ◽

Pilot Studies ◽

Ecological Features ◽

Wide Range

ABSTRACTSSP (simulation-based sampling protocol) is an R package that uses simulation of ecological data and dissimilarity-based multivariate standard error (MultSE) as an estimator of precision to evaluate the adequacy of different sampling efforts for studies that will test hypothesis using permutational multivariate analysis of variance. The procedure consists in simulating several extensive data matrixes that mimic some of the relevant ecological features of the community of interest using a pilot data set. For each simulated data, several sampling efforts are repeatedly executed and MultSE calculated. The mean value, 0.025 and 0.975 quantiles of MultSE for each sampling effort across all simulated data are then estimated and standardized regarding the lowest sampling effort. The optimal sampling effort is identified as that in which the increase in sampling effort do not improve the precision beyond a threshold value (e.g. 2.5 %). The performance of SSP was validated using real data, and in all examples the simulated data mimicked well the real data, allowing to evaluate the relationship MultSE – n beyond the sampling size of the pilot studies. SSP can be used to estimate sample size in a wide range of situations, ranging from simple (e.g. single site) to more complex (e.g. several sites for different habitats) experimental designs. The latter constitutes an important advantage, since it offers new possibilities for complex sampling designs, as it has been advised for multi-scale studies in ecology.

Download Full-text

Multiphysics anomaly map: A new data fusion workflow for geophysical interpretation

Interpretation ◽

10.1190/int-2018-0178.1 ◽

2020 ◽

Vol 8 (2) ◽

pp. B35-B43

Author(s):

Julio Cesar S. O. Lyrio ◽

Paulo T. L. Menezes ◽

Jorlivan L. Correa ◽

Adriano R. Viana

Keyword(s):

Data Fusion ◽

Geophysical Methods ◽

Real Data ◽

Logistic Function ◽

Geophysical Data ◽

Data Set ◽

Geophysical Interpretation ◽

Wide Range ◽

Anomaly Map ◽

Different Response

When collecting and processing geophysical data for exploration, the same geologic feature can generate a different response for each rock property being targeted. Typically, the units of these responses may differ by several orders of magnitude; therefore, the combination of geophysical data in integrated interpretation is not a straightforward process and cannot be performed by visual inspection only. The multiphysics anomaly map (MAM) that we have developed is a data fusion solution that consists of a spatial representation of the correlation between anomalies detected with different geophysical methods. In the MAM, we mathematically process geophysical data such as seismic attributes, gravity, magnetic, and resistivity before combining them in a single map. In each data set, anomalous regions of interest, which are problem-dependent, are selected by the interpreter. Selected anomalies are highlighted through the use of a logistic function, which is specially designed to clip large magnitudes and rescale the range of values, increasing the discrimination of anomalies. The resulting anomalies, named logistic anomalies, represent regions of large probabilities of target occurrence. This new solution highlights areas where individual interpretations of different geophysical methods correlate, increasing the confidence in the interpretation. We determine the effectiveness of our MAM with application to real data from onshore and offshore Brazil. In the onshore Recôncavo Basin, the MAM allows the interpreter to identify a channel where a drilled well found the largest sandstone thickness on the area. In a second example, from offshore Sergipe-Alagoas Basin, the MAM helps differentiate between a dry and an oil-bearing channel previously outlined in seismic data. Therefore, these outcomes indicate that the MAM is a valid interpretation tool that we believe can be applied to a wide range of geologic problems.

Download Full-text

Sources of Artifacts in SLODR Detection

Psychology in Russia State of Art ◽

10.11621/pir.2021.0107 ◽

2021 ◽

Vol 14 (1) ◽

pp. 86-100

Author(s):

Aleksei A. Korneev ◽

Anatoly N. Krichevets ◽

Konstantin V. Sugonyaev ◽

Dmitriy V. Ushakov ◽

Alexander G. Vinogradov ◽

...

Keyword(s):

Correlation Matrix ◽

Simulated Data ◽

Real Data ◽

False Identification ◽

Data Set ◽

The Third ◽

Intellectual Abilities ◽

The Matrix ◽

Simulation Parameters ◽

Selection Of

Background. Spearman’s law of diminishing returns (SLODR) states that intercorrelations between scores on tests of intellectual abilities were higher when the data set was comprised of subjects with lower intellectual abilities and vice versa. After almost a hundred years of research, this trend has only been detected on average. Objective. To determine whether the very different results were obtained due to variations in scaling and the selection of subjects. Design. We used three methods for SLODR detection based on moderated factor analysis (MFCA) to test real data and three sets of simulated data. Of the latter group, the first one simulated a real SLODR effect. The second one simulated the case of a different density of tasks of varying difficulty; it did not have a real SLODR effect. The third one simulated a skewed selection of respondents with different abilities and also did not have a real SLODR effect. We selected the simulation parameters so that the correlation matrix of the simulated data was similar to the matrix created from the real data, and all distributions had similar skewness parameters (about -0.3). Results. The results of MFCA are contradictory and we cannot clearly distinguish by this method the dataset with real SLODR from datasets with similar correlation structure and skewness, but without a real SLODR effect. Theresults allow us to conclude that when effects like SLODR are very subtle and can be identified only with a large sample, then features of the psychometric scale become very important, because small variations of scale metrics may lead either to masking of real SLODR or to false identification of SLODR.

Download Full-text

Detection of Lung Nodules in Micro-CT Imaging Using Deep Learning

Tomography ◽

10.3390/tomography7030032 ◽

2021 ◽

Vol 7 (3) ◽

pp. 358-372

Author(s):

Matthew D. Holbrook ◽

Darin P. Clark ◽

Rutulkumar Patel ◽

Yi Qi ◽

Alex M. Bassil ◽

...

Keyword(s):

Deep Learning ◽

Lung Nodule ◽

Model Performance ◽

Area Under The Curve ◽

Simulated Data ◽

Real Data ◽

Lung Tumors ◽

Training Data ◽

Micro Ct ◽

Micro Computed Tomography

We are developing imaging methods for a co-clinical trial investigating synergy between immunotherapy and radiotherapy. We perform longitudinal micro-computed tomography (micro-CT) of mice to detect lung metastasis after treatment. This work explores deep learning (DL) as a fast approach for automated lung nodule detection. We used data from control mice both with and without primary lung tumors. To augment the number of training sets, we have simulated data using real augmented tumors inserted into micro-CT scans. We employed a convolutional neural network (CNN), trained with four competing types of training data: (1) simulated only, (2) real only, (3) simulated and real, and (4) pretraining on simulated followed with real data. We evaluated our model performance using precision and recall curves, as well as receiver operating curves (ROC) and their area under the curve (AUC). The AUC appears to be almost identical (0.76–0.77) for all four cases. However, the combination of real and synthetic data was shown to improve precision by 8%. Smaller tumors have lower rates of detection than larger ones, with networks trained on real data showing better performance. Our work suggests that DL is a promising approach for fast and relatively accurate detection of lung tumors in mice.

Download Full-text

Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1618 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Guro Dørum ◽

Lars Snipen ◽

Margrete Solheim ◽

Solve Saebo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Simulated Data ◽

Real Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Gene Set ◽

Network Information

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.

Download Full-text

ESTIMATION OF EXTREME QUANTILES: EMPIRICAL TOOLS FOR METHODS ASSESSMENT AND COMPARISON

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539300000079 ◽

2000 ◽

Vol 07 (01) ◽

pp. 75-94 ◽

Cited By ~ 3

Author(s):

J. DIEBOLT ◽

M.-A. EL-AROUI ◽

V. DURBEC ◽

B. VILLAIN

Keyword(s):

Goodness Of Fit ◽

Simulated Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Extreme Quantiles ◽

Maintenance Policies ◽

Simulated Data Sets ◽

Industrial Context

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.

Download Full-text

Identifying and Classifying Aberrant Response Patterns Through Functional Data Analysis

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998620911941 ◽

2020 ◽

Vol 45 (6) ◽

pp. 719-749

Author(s):

Eduardo Doval ◽

Pedro Delicado

Keyword(s):

Data Analysis ◽

Functional Data Analysis ◽

Functional Data ◽

Simulated Data ◽

Real Data ◽

Response Patterns ◽

Fit Indices ◽

Person Fit ◽

Data Set ◽

Aberrant Response Patterns

We propose new methods for identifying and classifying aberrant response patterns (ARPs) by means of functional data analysis. These methods take the person response function (PRF) of an individual and compare it with the pattern that would correspond to a generic individual of the same ability according to the item-person response surface. ARPs correspond to atypical difference functions. The ARP classification is done with functional data clustering applied to the PRFs identified as ARP. We apply these methods to two sets of simulated data (the first is used to illustrate the ARP identification methods and the second demonstrates classification of the response patterns flagged as ARP) and a real data set (a Grade 12 science assessment test, SAT, with 32 items answered by 600 examinees). For comparative purposes, ARPs are also identified with three nonparametric person-fit indices (Ht, Modified Caution Index, and ZU3). Our results indicate that the ARP detection ability of one of our proposed methods is comparable to that of person-fit indices. Moreover, the proposed classification methods enable ARP associated with either spuriously low or spuriously high scores to be distinguished.

Download Full-text

Fully Moderated T-statistic for Small Sample Size Gene Expression Arrays

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1701 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 15

Author(s):

Lianbo Yu ◽

Parul Gulati ◽

Soledad Fernandez ◽

Michael Pennell ◽

Lawrence Kirschner ◽

...

Keyword(s):

Gene Expression ◽

Small Sample Size ◽

Simulated Data ◽

Real Data ◽

Small Sample ◽

Data Set ◽

Expression Arrays ◽

Gene Expression Arrays ◽

Higher Power ◽

Constant Coefficient Of Variation

Gene expression microarray experiments with few replications lead to great variability in estimates of gene variances. Several Bayesian methods have been developed to reduce this variability and to increase power. Thus far, moderated t methods assumed a constant coefficient of variation (CV) for the gene variances. We provide evidence against this assumption, and extend the method by allowing the CV to vary with gene expression. Our CV varying method, which we refer to as the fully moderated t-statistic, was compared to three other methods (ordinary t, and two moderated t predecessors). A simulation study and a familiar spike-in data set were used to assess the performance of the testing methods. The results showed that our CV varying method had higher power than the other three methods, identified a greater number of true positives in spike-in data, fit simulated data under varying assumptions very well, and in a real data set better identified higher expressing genes that were consistent with functional pathways associated with the experiments.

Download Full-text