The Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in Phylogenomic Inference

Huai-Chun Wang; Edward Susko; Andrew J Roger

doi:10.1093/sysbio/syz021

The Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in Phylogenomic Inference

Systematic Biology ◽

10.1093/sysbio/syz021 ◽

2019 ◽

Vol 68 (6) ◽

pp. 1003-1019 ◽

Cited By ~ 7

Author(s):

Huai-Chun Wang ◽

Edward Susko ◽

Andrew J Roger

Keyword(s):

Small Sample ◽

Data Sets ◽

Large Set ◽

Relative Importance ◽

Data Set ◽

Heterogeneous Models ◽

Homogeneous Models ◽

Phylogenomic Analyses ◽

Complete Protein ◽

Protein Alignments

Abstract Large taxa-rich genome-scale data sets are often necessary for resolving ancient phylogenetic relationships. But accurate phylogenetic inference requires that they are analyzed with realistic models that account for the heterogeneity in substitution patterns amongst the sites, genes and lineages. Two kinds of adjustments are frequently used: models that account for heterogeneity in amino acid frequencies at sites in proteins, and partitioned models that accommodate the heterogeneity in rates (branch lengths) among different proteins in different lineages (protein-wise heterotachy). Although partitioned and site-heterogeneous models are both widely used in isolation, their relative importance to the inference of correct phylogenies has not been carefully evaluated. We conducted several empirical analyses and a large set of simulations to compare the relative performances of partitioned models, site-heterogeneous models, and combined partitioned site heterogeneous models. In general, site-homogeneous models (partitioned or not) performed worse than site heterogeneous, except in simulations with extreme protein-wise heterotachy. Furthermore, simulations using empirically-derived realistic parameter settings showed a marked long-branch attraction (LBA) problem for analyses employing protein-wise partitioning even when the generating model included partitioning. This LBA problem results from a small sample bias compounded over many single protein alignments. In some cases, this problem was ameliorated by clustering similarly-evolving proteins together into larger partitions using the PartitionFinder method. Similar results were obtained under simulations with larger numbers of taxa or heterogeneity in simulating topologies over genes. For an empirical Microsporidia test data set, all but one tested site-heterogeneous models (with or without partitioning) obtain the correct Microsporidia+Fungi grouping, whereas site-homogenous models (with or without partitioning) did not. The single exception was the fully partitioned site-heterogeneous analysis that succumbed to the compounded small sample LBA bias. In general unless protein-wise heterotachy effects are extreme, it is more important to model site-heterogeneity than protein-wise heterotachy in phylogenomic analyses. Complete protein-wise partitioning should be avoided as it can lead to a serious LBA bias. In cases of extreme protein-wise heterotachy, approaches that cluster similarly-evolving proteins together and coupled with site-heterogeneous models work well for phylogenetic estimation.

Download Full-text

Dynamic spatio-temporal generation of large-scale synthetic gridded precipitation: with improved spatial coherence of extremes

Stochastic Environmental Research and Risk Assessment ◽

10.1007/s00477-019-01724-9 ◽

2019 ◽

Vol 34 (9) ◽

pp. 1369-1383 ◽

Cited By ~ 1

Author(s):

Dirk Diederen ◽

Ye Liu

Keyword(s):

Large Scale ◽

Spatial Coherence ◽

Original Data ◽

Return Level ◽

Data Sets ◽

Large Set ◽

Precipitation Data ◽

Data Set ◽

Spatio Temporal ◽

Synthetic Precipitation

Abstract With the ongoing development of distributed hydrological models, flood risk analysis calls for synthetic, gridded precipitation data sets. The availability of large, coherent, gridded re-analysis data sets in combination with the increase in computational power, accommodates the development of new methodology to generate such synthetic data. We tracked moving precipitation fields and classified them using self-organising maps. For each class, we fitted a multivariate mixture model and generated a large set of synthetic, coherent descriptors, which we used to reconstruct moving synthetic precipitation fields. We introduced randomness in the original data set by replacing the observed precipitation fields in the original data set with the synthetic precipitation fields. The output is a continuous, gridded, hourly precipitation data set of a much longer duration, containing physically plausible and spatio-temporally coherent precipitation events. The proposed methodology implicitly provides an important improvement in the spatial coherence of precipitation extremes. We investigate the issue of unrealistic, sudden changes on the grid and demonstrate how a dynamic spatio-temporal generator can provide spatial smoothness in the probability distribution parameters and hence in the return level estimates.

Download Full-text

Applying Lexical Link Analysis to Discover Insights from Public Information on COVID-19

10.1101/2020.05.06.079798 ◽

2020 ◽

Author(s):

Ying Zhao ◽

Charles C. Zhou

Keyword(s):

Open Data ◽

Public Information ◽

Link Analysis ◽

Data Sets ◽

Large Set ◽

Mining Method ◽

Economic Activities ◽

Data Set ◽

Information Mining ◽

Unique Information

SARS-Cov-2, the deadly and novel virus, which has caused a worldwide pandemic and drastic loss of human lives and economic activities. An open data set called the COVID-19 Open Research Dataset or CORD-19 contains large set full text scientific literature on SARS-CoV-2. The Next Strain consists of a database of SARS-CoV-2 viral genomes from since 12/3/2019. We applied an unique information mining method named lexical link analysis (LLA) to answer the call to action and help the science community answer high-priority scientific questions related to SARS-CoV-2. We first text-mined the CORD-19. We also data-mined the next strain database. Finally, we linked two databases. The linked databases and information can be used to discover the insights and help the research community to address high-priority questions related to the SARS-CoV-2’s genetics, tests, and prevention.Significance StatementIn this paper, we show how to apply an unique information mining method lexical link analysis (LLA) to link unstructured (CORD-19) and structured (Next Strain) data sets to relevant publications, integrate text and data mining into a single platform to discover the insights that can be visualized, and validated to answer the high-priority questions of genetics, incubation, treatment, symptoms, and prevention of COVID-19.

Download Full-text

The Support Feature Machine: Classification with the Least Number of Features and Application to Neuroimaging Data

Neural Computation ◽

10.1162/neco_a_00447 ◽

2013 ◽

Vol 25 (6) ◽

pp. 1548-1584 ◽

Cited By ~ 2

Author(s):

Sascha Klement ◽

Silke Anders ◽

Thomas Martinetz

Keyword(s):

Feature Selection ◽

Small Sample Size ◽

Small Sample ◽

Biological Data ◽

Support Vector ◽

Data Sets ◽

Universal Method ◽

Data Set ◽

Separating Hyperplane ◽

New Formulation

By minimizing the zero-norm of the separating hyperplane, the support feature machine (SFM) finds the smallest subspace (the least number of features) of a data set such that within this subspace, two classes are linearly separable without error. This way, the dimensionality of the data is more efficiently reduced than with support vector–based feature selection, which can be shown both theoretically and empirically. In this letter, we first provide a new formulation of the previously introduced concept of the SFM. With this new formulation, classification of unbalanced and nonseparable data is straightforward, which allows using the SFM for feature selection and classification in a large variety of different scenarios. To illustrate how the SFM can be used to identify both the smallest subset of discriminative features and the total number of informative features in biological data sets we apply repetitive feature selection based on the SFM to a functional magnetic resonance imaging data set. We suggest that these capabilities qualify the SFM as a universal method for feature selection, especially for high-dimensional small-sample-size data sets that often occur in biological and medical applications.

Download Full-text

FINGERPRINT SILICON REPLICAS: STATIC AND DYNAMIC FEATURES FOR VITALITY DETECTION USING AN OPTICAL CAPTURE DEVICE

International Journal of Image and Graphics ◽

10.1142/s0219467808003209 ◽

2008 ◽

Vol 08 (04) ◽

pp. 495-512 ◽

Cited By ~ 26

Author(s):

PIETRO COLI ◽

GIAN LUCA MARCIALIS ◽

FABIO ROLI

Keyword(s):

Performance Improvement ◽

Small Sample Size ◽

Small Sample ◽

Small Data ◽

Data Sets ◽

Dynamic Features ◽

Data Set ◽

Feature Sets ◽

Small Data Sets ◽

Verification Systems

The automatic vitality detection of a fingerprint has become an important issue in personal verification systems based on this biometric. It has been shown that fake fingerprints made using materials like gelatine or silicon can deceive commonly used sensors. Recently, the extraction of vitality features from fingerprint images has been proposed to address this problem. Among others, static and dynamic features have been separately studied so far, thus their respective merits are not yet clear; especially because reported results were often obtained with different sensors and using small data sets which could have obscured relative merits, due to the potential small sample-size issues. In this paper, we compare some static and dynamic features by experiments on a larger data set and using the same optical sensor for the extraction of both feature sets. We dealt with fingerprint stamps made using liquid silicon rubber. Reported results show the relative merits of static and dynamic features and the performance improvement achievable by using such features together.

Download Full-text

Gene Expression Analysis of Independent Data Sets Identifies HBG1 to Be Associated with Outcome in Cytogenetically Normal AML.

Blood ◽

10.1182/blood.v114.22.2613.2613 ◽

2009 ◽

Vol 114 (22) ◽

pp. 2613-2613

Author(s):

Lars Bullinger ◽

Thomas Hielscher ◽

Klaus H. Metzeler ◽

Ursula Botzenhardt ◽

Sabrina Heinrich ◽

...

Keyword(s):

Gene Expression ◽

Overall Survival ◽

Molecular Markers ◽

Surrogate Marker ◽

Molecular Heterogeneity ◽

Data Sets ◽

Prognostic Impact ◽

Large Set ◽

Data Set ◽

Prognostic Relevance

Abstract Abstract 2613 Poster Board II-589 Cytogenetically normal acute myeloid leukemia (CN-AML) represents a biologically and clinically heterogeneous group. During recent years novel molecular markers like FLT3, CEBPA and NPM1 gene mutations as well as deregulated expression of single genes such as EVI1, MN1, ERG and BAALC have been identified that provide important prognostic information in CN-AML. Furthermore, DNA microarray-based gene expression profiling (GEP) has been shown to capture the molecular heterogeneity of leukemia and has been applied to build clinical outcome predictors in CN-AML. In this study, we wanted to assess whether GEP based outcome prediction using novel biostatistical approaches applied to large gene expression data sets could refine previous findings. First we profiled gene expression in a large set of clinically well annotated CN-AML (entered on a multicenter trial for patients <60 years, AMLSG 07-04, n=154 cases) using Affymetrix Human Genome U133plus2.0 microarrays. Then, we applied L1-penalized Cox proportional hazards regression to develop a sparse prognostic model for overall survival. Using this algorithm we were able to define a gene expression signature correlated with outcome in CN-AML. Interestingly, our model resulted in a signature that only comprised one probe set corresponding to the HBG1 gene (hemoglobin, gamma A). While quantitative RT-PCR validated correct measurement of HBG1 expression (correlation r=0.96, n=15), we next evaluated our finding in independent cohorts of CN-AML cases. We applied our “HBG1 gene signature” to previously published CN-AML data (Metzeler et al. Blood 2008) comprising 163 patient samples on Affymetrix U133A/B (data set I) and 79 patient samples on Affymetrix U133plus2.0 microarrays (data set II). Univariate Cox PH regression showed that our gene expression predictor was highly significant for overall survival in both data sets (P=0.002, HR 1.71, 95%CI 1.23–2.39; and P<0.001, HR 3.03, 95%CI 1.84–5.02, respectively). Adjusted for age, NPM1, and FLT3-ITD mutational status HBG1 retained its prognostic relevance in multivariate analysis (P=0.007 and P<0.001, respectively). As HBG1 expression might be used as novel prognostic marker in AML, the role of HBG1 expression in AML remains to be determined. HBG1 expression might represent a surrogate marker indicating the activation of distinct oncoproteins such as MYB, which recently has been shown to transcriptionally repress the level of gamma-globulin. On the other hand HBG1 expression has been shown to be induced upon exposure to azacytidine. Thus, HBG1 expression could also reflect underlying epigenetic profiles of prognostic relevance. While studying HBG1 expression in larger CN-AML cohorts as well as in the context of other molecular markers might help to further determine the prognostic impact of HBG1, integrative analyses combining GEP with genomics and epigenomics data sets will provide novel insights into the mechanisms underlying HBG1 expression in CN-AML. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

Widespread paleopolyploidy, gene tree conflict, and recalcitrant relationships among the carnivorous Caryophyllales

10.1101/115741 ◽

2017 ◽

Cited By ~ 1

Author(s):

Joseph F. Walker ◽

Ya Yang ◽

Michael J. Moore ◽

Jessica Mikenas ◽

Alfonso Timoneda ◽

...

Keyword(s):

Gene Tree ◽

West African ◽

Evolutionary Relationships ◽

Taxon Sampling ◽

Data Sets ◽

List Type ◽

Data Set ◽

Present Evidence ◽

Gene Tree Discordance ◽

Phylogenomic Analyses

ABSTRACTThe carnivorous members of the large, hyperdiverse Caryophyllales (e.g. Venus flytrap, sundews and Nepenthes pitcher plants) represent perhaps the oldest and most diverse lineage of carnivorous plants. However, despite numerous studies seeking to elucidate their evolutionary relationships, the early-diverging relationships remain unresolved.To explore the utility of phylogenomic data sets for resolving relationships among the carnivorous Caryophyllales, we sequenced ten transcriptomes, including all the carnivorous genera except those in the rare West African liana family (Dioncophyllaceae). We used a variety of methods to infer the species tree, examine gene tree conflict and infer paleopolyploidy events.Phylogenomic analyses support the monophyly of the carnivorous Caryophyllales, with an origin of 68-83 mya. In contrast to previous analyses recover the remaining non-core Caryophyllales as non-monophyletic, although there are multiple reasons this result may be spurious and node supporting this relationship contains a significant amount gene tree discordance. We present evidence that the clade contains at least seven independent paleopolyploidy events, previously debated nodes from the literature have high levels of gene tree conflict, and taxon sampling influences topology even in a phylogenomic data set.Our data demonstrate the importance of carefully considering gene tree conflict and taxon sampling in phylogenomic analyses. Moreover, they provide a remarkable example of the propensity for paleopolyploidy in angiosperms, with at least seven such events in a clade of less than 2500 species.

Download Full-text

Variability of Classification Results in Data with High Dimensionality and Small Sample Size

Information Technology and Management Science ◽

10.7250/itms-2021-0007 ◽

2021 ◽

Vol 24 ◽

pp. 45-52

Author(s):

Jana Busa ◽

Inese Polaka

Keyword(s):

Small Sample Size ◽

Antibiotic Use ◽

Small Sample ◽

Biological Data ◽

Classification Model ◽

Intestinal Microbiome ◽

High Dimensionality ◽

Data Sets ◽

Data Set ◽

Before And After

The study focuses on the analysis of biological data containing information on the number of genome sequences of intestinal microbiome bacteria before and after antibiotic use. The data have high dimensionality (bacterial taxa) and a small number of records, which is typical of bioinformatics data. Classification models induced on data sets like this usually are not stable and the accuracy metrics have high variance. The aim of the study is to create a preprocessing workflow and a classification model that can perform the most accurate classification of the microbiome into groups before and after the use of antibiotics and lessen the variability of accuracy measures of the classifier. To evaluate the accuracy of the model, measures of the area under the ROC curve and the overall accuracy of the classifier were used. In the experiments, the authors examined how classification results were affected by feature selection and increased size of the data set.

Download Full-text

Small sample sizes: A big data problem in high-dimensional data analysis

Statistical Methods in Medical Research ◽

10.1177/0962280220970228 ◽

2020 ◽

pp. 096228022097022

Author(s):

Frank Konietschke ◽

Karima Schwab ◽

Markus Pauly

Keyword(s):

Repeated Measures ◽

Real Data ◽

Small Sample ◽

High Dimensional ◽

Data Sets ◽

Sample Sizes ◽

Preclinical Research ◽

Data Set ◽

Data Problem ◽

Small Sample Sizes

In many experiments and especially in translational and preclinical research, sample sizes are (very) small. In addition, data designs are often high dimensional, i.e. more dependent than independent replications of the trial are observed. The present paper discusses the applicability of max t-test-type statistics (multiple contrast tests) in high-dimensional designs (repeated measures or multivariate) with small sample sizes. A randomization-based approach is developed to approximate the distribution of the maximum statistic. Extensive simulation studies confirm that the new method is particularly suitable for analyzing data sets with small sample sizes. A real data set illustrates the application of the methods.

Download Full-text

Impact of sample size on principal component analysis ordination of an environmental data set: effects on eigenstructure

Ekológia (Bratislava) ◽

10.1515/eko-2016-0014 ◽

2016 ◽

Vol 35 (2) ◽

pp. 173-190 ◽

Cited By ~ 13

Author(s):

S. Shahid Shaukat ◽

Toqeer Ahmed Rao ◽

Moazzam A. Khan

Keyword(s):

Principal Component Analysis ◽

Sample Size ◽

Principal Component ◽

Component Analysis ◽

Small Sample ◽

Environmental Data ◽

Data Matrix ◽

Data Sets ◽

Data Set ◽

The Impact

AbstractIn this study, we used bootstrap simulation of a real data set to investigate the impact of sample size (N = 20, 30, 40 and 50) on the eigenvalues and eigenvectors resulting from principal component analysis (PCA). For each sample size, 100 bootstrap samples were drawn from environmental data matrix pertaining to water quality variables (p = 22) of a small data set comprising of 55 samples (stations from where water samples were collected). Because in ecology and environmental sciences the data sets are invariably small owing to high cost of collection and analysis of samples, we restricted our study to relatively small sample sizes. We focused attention on comparison of first 6 eigenvectors and first 10 eigenvalues. Data sets were compared using agglomerative cluster analysis using Ward’s method that does not require any stringent distributional assumptions.

Download Full-text

Hybrid Reinforcement/Supervised Learning of Dialogue Policies from Fixed Data Sets

Computational Linguistics ◽

10.1162/coli.2008.07-028-r2-05-82 ◽

2008 ◽

Vol 34 (4) ◽

pp. 487-511 ◽

Cited By ~ 46

Author(s):

James Henderson ◽

Oliver Lemon ◽

Kallirroi Georgila

Keyword(s):

Reinforcement Learning ◽

Supervised Learning ◽

Hybrid Model ◽

Learning Model ◽

Data Sets ◽

Large Set ◽

Data Set ◽

Dialogue Management ◽

Large State ◽

Management Policies

We propose a method for learning dialogue management policies from a fixed data set. The method addresses the challenges posed by Information State Update (ISU)-based dialogue systems, which represent the state of a dialogue as a large set of features, resulting in a very large state space and a huge policy space. To address the problem that any fixed data set will only provide information about small portions of these state and policy spaces, we propose a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy to the portions of these spaces for which we have data. We also use linear function approximation to address the need to generalize from a fixed amount of data to large state spaces. To demonstrate the effectiveness of this method on this challenging task, we trained this model on the COMMUNICATOR corpus, to which we have added annotations for user actions and Information States. When tested with a user simulation trained on a different part of the same data set, our hybrid model outperforms a pure supervised learning model and a pure reinforcement learning model. It also outperforms the hand-crafted systems on the COMMUNICATOR data, according to automatic evaluation measures, improving over the average COMMUNICATOR system policy by 10%. The proposed method will improve techniques for bootstrapping and automatic optimization of dialogue management policies from limited initial data sets.

Download Full-text