Robustness of classification and ordination techniques applied to Macroinvertebrate communities from the La Trobe River, Victoria

1990 ◽  
Vol 41 (4) ◽  
pp. 493 ◽  
Author(s):  
R Marchant

The robustness of site groupings produced by ordination (DECORANA) and classification (TWINSPAN) techniques to variations in the quality of the raw data was investigated, using two data sets on macroinvertebrate communities from the La Trobe River. Ordinations or classifications based on the presence or absence of species were not substantially different from those based on actual abundance levels. However, when taxonomic discrimination was reduced from the species (or genus) level to the family level, distortions occurred in the resulting ordinations and classifications. In addition, ordinations based on 10 replicates per sample were little different from those based on a subset of 5 or 6 of these replicates; fewer than 4 replicates did not adequately represent the patterns present in the full data set.

2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2017 ◽  
Vol 6 (3) ◽  
pp. 71 ◽  
Author(s):  
Claudio Parente ◽  
Massimiliano Pepe

The purpose of this paper is to investigate the impact of weights in pan-sharpening methods applied to satellite images. Indeed, different data sets of weights have been considered and compared in the IHS and Brovey methods. The first dataset contains the same weight for each band while the second takes in account the weighs obtained by spectral radiance response; these two data sets are most common in pan-sharpening application. The third data set is resulting by a new method. It consists to compute the inertial moment of first order of each band taking in account the spectral response. For testing the impact of the weights of the different data sets, WorlView-3 satellite images have been considered. In particular, two different scenes (the first in urban landscape, the latter in rural landscape) have been investigated. The quality of pan-sharpened images has been analysed by three different quality indexes: Root mean square error (RMSE), Relative average spectral error (RASE) and Erreur Relative Global Adimensionnelle de Synthèse (ERGAS).


2005 ◽  
Vol 5 (7) ◽  
pp. 1835-1841 ◽  
Author(s):  
S. Noël ◽  
M. Buchwitz ◽  
H. Bovensmann ◽  
J. P. Burrows

Abstract. A first validation of water vapour total column amounts derived from measurements of the SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY (SCIAMACHY) in the visible spectral region has been performed. For this purpose, SCIAMACHY water vapour data have been determined for the year 2003 using an extended version of the Differential Optical Absorption Spectroscopy (DOAS) method, called Air Mass Corrected (AMC-DOAS). The SCIAMACHY results are compared with corresponding water vapour measurements by the Special Sensor Microwave Imager (SSM/I) and with model data from the European Centre for Medium-Range Weather Forecasts (ECMWF). In confirmation of previous results it could be shown that SCIAMACHY derived water vapour columns are typically slightly lower than both SSM/I and ECMWF data, especially over ocean areas. However, these deviations are much smaller than the observed scatter of the data which is caused by the different temporal and spatial sampling and resolution of the data sets. For example, the overall difference with ECMWF data is only -0.05 g/cm2 whereas the typical scatter is in the order of 0.5 g/cm2. Both values show almost no variation over the year. In addition, first monthly means of SCIAMACHY water vapour data have been computed. The quality of these monthly means is currently limited by the availability of calibrated SCIAMACHY spectra. Nevertheless, first comparisons with ECMWF data show that SCIAMACHY (and similar instruments) are able to provide a new independent global water vapour data set.


2011 ◽  
Vol 77 (19) ◽  
pp. 7000-7006 ◽  
Author(s):  
Nicola M. Reid ◽  
Sarah L. Addison ◽  
Lucy J. Macdonald ◽  
Gareth Lloyd-Jones

ABSTRACTHuhu grubs (Prionoplus reticularis) are wood-feeding beetle larvae endemic to New Zealand and belonging to the family Cerambycidae. Compared to the wood-feeding lower termites, very little is known about the diversity and activity of microorganisms associated with xylophagous cerambycid larvae. To address this, we used pyrosequencing to evaluate the diversity of metabolically active and inactive bacteria in the huhu larval gut. Our estimate, that the gut harbors at least 1,800 phylotypes, is based on 33,420 sequences amplified from genomic DNA and reverse-transcribed RNA. Analysis of genomic DNA- and RNA-derived data sets revealed that 71% of all phylotypes (representing 95% of all sequences) were metabolically active. Rare phylotypes contributed considerably to the richness of the community and were also largely metabolically active, indicating their participation in digestive processes in the gut. The dominant families in the active community (RNA data set) includedAcidobacteriaceae(24.3%),Xanthomonadaceae(16.7%),Acetobacteraceae(15.8%),Burkholderiaceae(8.7%), andEnterobacteriaceae(4.1%). The most abundant phylotype comprised 14% of the active community and affiliated withDyella ginsengisoli(Gammaproteobacteria), suggesting that aDyella-related organism is a likely symbiont. This study provides new information on the diversity and activity of gut-associated microorganisms that are essential for the digestion of the nutritionally poor diet consumed by wood-feeding larvae. Many huhu gut phylotypes affiliated with insect symbionts or with bacteria present in acidic environments or associated with fungi.


Endocrinology ◽  
2019 ◽  
Vol 160 (10) ◽  
pp. 2395-2400 ◽  
Author(s):  
David J Handelsman ◽  
Lam P Ly

Abstract Hormone assay results below the assay detection limit (DL) can introduce bias into quantitative analysis. Although complex maximum likelihood estimation methods exist, they are not widely used, whereas simple substitution methods are often used ad hoc to replace the undetectable (UD) results with numeric values to facilitate data analysis with the full data set. However, the bias of substitution methods for steroid measurements is not reported. Using a large data set (n = 2896) of serum testosterone (T), DHT, estradiol (E2) concentrations from healthy men, we created modified data sets with increasing proportions of UD samples (≤40%) to which we applied five different substitution methods (deleting UD samples as missing and substituting UD sample with DL, DL/√2, DL/2, or 0) to calculate univariate descriptive statistics (mean, SD) or bivariate correlations. For all three steroids and for univariate as well as bivariate statistics, bias increased progressively with increasing proportion of UD samples. Bias was worst when UD samples were deleted or substituted with 0 and least when UD samples were substituted with DL/√2, whereas the other methods (DL or DL/2) displayed intermediate bias. Similar findings were replicated in randomly drawn small subsets of 25, 50, and 100. Hence, we propose that in steroid hormone data with ≤40% UD samples, substituting UD with DL/√2 is a simple, versatile, and reasonably accurate method to minimize left censoring bias, allowing for data analysis with the full data set.


2020 ◽  
Author(s):  
Oleg Skrynyk ◽  
Enric Aguilar ◽  
José A. Guijarro ◽  
Sergiy Bubin

<p>Before using climatological time series in research studies, it is necessary to perform their quality control and homogenization in order to remove possible artefacts (inhomogeneities) usually present in the raw data sets. In the vast majority of cases, the homogenization procedure allows to improve the consistency of the data, which then can be verified by means of the statistical comparison of the raw and homogenized time series. However, a new question then arises: how far are the homogenized data from the true climate signal or, in other words, what errors could still be present in homogenized data?</p><p>The main objective of our work is to estimate the uncertainty produced by the adjustment algorithm of the widely used Climatol homogenization software when homogenizing daily time series of the additive climate variables. We focused our efforts on the minimum and maximum air temperature. In order to achieve our goal we used a benchmark data set created by the INDECIS<sup>*</sup> project. The benchmark contains clean data, extracted from an output of the Royal Netherlands Meteorological Institute Regional Atmospheric Climate Model (version 2) driven by Hadley Global Environment Model 2 - Earth System, and inhomogeneous data, created by introducing realistic breaks and errors.</p><p>The statistical evaluation of discrepancies between the homogenized (by means of Climatol with predefined break points) and clean data sets was performed using both a set of standard parameters and a metrics introduced in our work. All metrics used clearly identifies the main features of errors (systematic and random) present in the homogenized time series. We calculated the metrics for every time series (only over adjusted segments) as well as their averaged values as measures of uncertainties in the whole data set.</p><p>In order to determine how the two key parameters of the raw data collection, namely the length of time series and station density, influence the calculated measures of the adjustment error we gradually decreased the length of the period and number of stations in the area under study. The total number of cases considered was 56, including 7 time periods (1950-2005, 1954-2005, …, 1974-2005) and 8 different quantities of stations (100, 90, …, 30). Additionally, in order to find out how stable are the calculated metrics for each of the 56 cases and determine their confidence intervals we performed 100 random permutations in the introduced inhomogeneity time series and repeated our calculations With that the total number of homogenization exercises performed was 5600 for each of two climate variables.</p><p>Lastly, the calculated metrics were compared with the corresponding values, obtained for raw time series. The comparison showed some substantial improvement of the metric values after homogenization in each of the 56 cases considered (for the both variables).</p><p>-------------------</p><p><sup>*</sup>INDECIS is a part of ERA4CS, an ERA-NET initiated by JPI Climate, and funded by FORMAS (SE), DLR (DE), BMWFW (AT), IFD (DK), MINECO (ES), ANR (FR) with co-funding by the European Union (Grant 690462). The work has been partially supported by the Ministry of Education and Science of Kazakhstan (Grant BR05236454) and Nazarbayev University (Grant 090118FD5345).</p>


Author(s):  
MUSTAPHA LEBBAH ◽  
YOUNÈS BENNANI ◽  
NICOLETA ROGOVSCHI

This paper introduces a probabilistic self-organizing map for topographic clustering, analysis and visualization of multivariate binary data or categorical data using binary coding. We propose a probabilistic formalism dedicated to binary data in which cells are represented by a Bernoulli distribution. Each cell is characterized by a prototype with the same binary coding as used in the data space and the probability of being different from this prototype. The learning algorithm, Bernoulli on self-organizing map, that we propose is an application of the EM standard algorithm. We illustrate the power of this method with six data sets taken from a public data set repository. The results show a good quality of the topological ordering and homogenous clustering.


2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2016 ◽  
Author(s):  
Brecht Martens ◽  
Diego G. Miralles ◽  
Hans Lievens ◽  
Robin van der Schalie ◽  
Richard A. M. de Jeu ◽  
...  

Abstract. The Global Land Evaporation Amsterdam Model (GLEAM) is a set of algorithms dedicated to the estimation of terrestrial evaporation and root-zone soil moisture from satellite data. Ever since its development in 2011, the model has been regularly revised aiming at the optimal incorporation of new satellite-observed geophysical variables, and improving the representation of physical processes. In this study, the next version of this model (v3) is presented. Key changes relative to the previous version include: (1) a revised formulation of the evaporative stress, (2) an optimized drainage algorithm, and (3) a new soil moisture data assimilation system. GLEAM v3 is used to produce three new data sets of terrestrial evaporation and root-zone soil moisture, including a 35-year data set spanning the period 1980–2014 (v3.0a, based on satellite-observed soil moisture, vegetation optical depth and snow water equivalents, reanalysis air temperature and radiation, and a multi-source precipitation product), and two fully satellite-based data sets. The latter two share most of their forcing, except for the vegetation optical depth and soil moisture products, which are based on observations from different passive and active C- and L-band microwave sensors (European Space Agency Climate Change Initiative data sets) for the first data set (v3.0b, spanning the period 2003–2015) and observations from the Soil Moisture and Ocean Salinity satellite in the second data set (v3.0c, spanning the period 2011–2015). These three data sets are described in detail, compared against analogous data sets generated using the previous version of GLEAM (v2), and validated against measurements from 64 eddy-covariance towers and 2338 soil moisture sensors across a broad range of ecosystems. Results indicate that the quality of the v3 soil moisture is consistently better than the one from v2: average correlations against in situ surface soil moisture measurements increase from 0.61 to 0.64 in case of the v3.0a data set and the representation of soil moisture in the second layer improves as well, with correlations increasing from 0.47 to 0.53. Similar improvements are observed for the two fully satellite-based data sets. Despite regional differences, the quality of the evaporation fluxes remains overall similar as the one obtained using the previous version of GLEAM, with average correlations against eddy-covariance measurements between 0.78 and 0.80 for the three different data sets. These global data sets of terrestrial evaporation and root-zone soil moisture are now openly available at http://GLEAM.eu and may be used for large-scale hydrological applications, climate studies and research on land-atmosphere feedbacks.


2019 ◽  
Author(s):  
Jacob Schreiber ◽  
Jeffrey Bilmes ◽  
William Stafford Noble

AbstractMotivationRecent efforts to describe the human epigenome have yielded thousands of uniformly processed epigenomic and transcriptomic data sets. These data sets characterize a rich variety of biological activity in hundreds of human cell lines and tissues (“biosamples”). Understanding these data sets, and specifically how they differ across biosamples, can help explain many cellular mechanisms, particularly those driving development and disease. However, due primarily to cost, the total number of assays that can be performed is limited. Previously described imputation approaches, such as Avocado, have sought to overcome this limitation by predicting genome-wide epigenomics experiments using learned associations among available epigenomic data sets. However, these previous imputations have focused primarily on measurements of histone modification and chromatin accessibility, despite other biological activity being crucially important.ResultsWe applied Avocado to a data set of 3,814 tracks of data derived from the ENCODE compendium, spanning 400 human biosamples and 84 assays. The resulting imputations cover measurements of chromatin accessibility, histone modification, transcription, and protein binding. We demonstrate the quality of these imputations by comprehensively evaluating the model’s predictions and by showing significant improvements in protein binding performance compared to the top models in an ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model, achieving high accuracy at predicting protein binding, even with only a single track of training data.AvailabilityTutorials and source code are available under an Apache 2.0 license at https://github.com/jmschrei/[email protected] or [email protected]


Sign in / Sign up

Export Citation Format

Share Document