scholarly journals A model-based clustering method to detect infectious disease transmission outbreaks from sequence variation

2017 ◽  
Author(s):  
Rosemary M McCloskey ◽  
Art FY Poon

AbstractClustering infections by genetic similarity is a popular technique for identifying potential outbreaks of infectious disease, in part because sequences are now routinely collected for clinical management of many infections. A diverse number of nonparametric clustering methods have been developed for this purpose. These methods are generally intuitive, rapid to compute, and readily scale with large data sets. However, we have found that nonparametric clustering methods can be biased towards identifying clusters of diagnosis — where individuals are sampled sooner post-infection — rather than the clusters of rapid transmission that are meant to be potential foci for public health efforts. We develop a fundamentally new approach to genetic clustering based on fitting a Markov-modulated Poisson process (MMPP), which represents the evolution of transmission rates along the tree relating different infections. We evaluated this model-based method alongside five nonparametric clustering methods using both simulated and actual HIV sequence data sets. For simulated clusters of rapid transmission, the MMPP clustering method obtained higher mean sensitivity (85%) and specificity (91%) than the nonparametric methods. When we applied these clustering methods to published HIV-1 sequences from a study cohort of men who have sex with men in Seattle, USA, we found that the MMPP method categorized about half (46%) as many individuals to clusters compared to the other methods, and that the MMPP clusters were more consistent with transmission outbreaks. This new approach to genetic clustering has significant implications for the application of pathogen sequence analysis to public health, where it is critical to robustly and accurately identify clusters for the most cost-effective deployment of resources.

2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


Author(s):  
Yasunori Endo ◽  
◽  
Tomoyuki Suzuki ◽  
Naohiko Kinoshita ◽  
Yukihiro Hamasuna ◽  
...  

The fuzzy non-metric model (FNM) is a representative non-hierarchical clustering method, which is very useful because the belongingness or the membership degree of each datum to each cluster can be calculated directly from the dissimilarities between data and the cluster centers are not used. However, the original FNM cannot handle data with uncertainty. In this study, we refer to the data with uncertainty as “uncertain data,” e.g., incomplete data or data that have errors. Previously, a methods was proposed based on the concept of a tolerance vector for handling uncertain data and some clustering methods were constructed according to this concept, e.g. fuzzyc-means for data with tolerance. These methods can handle uncertain data in the framework of optimization. Thus, in the present study, we apply the concept to FNM. First, we propose a new clustering algorithm based on FNM using the concept of tolerance, which we refer to as the fuzzy non-metric model for data with tolerance. Second, we show that the proposed algorithm can handle incomplete data sets. Third, we verify the effectiveness of the proposed algorithm based on comparisons with conventional methods for incomplete data sets in some numerical examples.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Connor Chato ◽  
Marcia L Kalish ◽  
Art F Y Poon

Abstract Genetic clustering is a popular method for characterizing variation in transmission rates for rapidly evolving viruses, and could potentially be used to detect outbreaks in ‘near real time’. However, the statistical properties of clustering are poorly understood in this context, and there are no objective guidelines for setting clustering criteria. Here, we develop a new statistical framework to optimize a genetic clustering method based on the ability to forecast new cases. We analysed the pairwise Tamura-Nei (TN93) genetic distances for anonymized HIV-1 subtype B pol sequences from Seattle (n = 1,653) and Middle Tennessee, USA (n = 2,779), and northern Alberta, Canada (n = 809). Under varying TN93 thresholds, we fit two models to the distributions of new cases relative to clusters of known cases: 1, a null model that assumes cluster growth is strictly proportional to cluster size, i.e. no variation in transmission rates among individuals; and 2, a weighted model that incorporates individual-level covariates, such as recency of diagnosis. The optimal threshold maximizes the difference in information loss between models, where covariates are used most effectively. Optimal TN93 thresholds varied substantially between data sets, e.g. 0.0104 in Alberta and 0.016 in Seattle and Tennessee, such that the optimum for one population would potentially misdirect prevention efforts in another. For a given population, the range of thresholds where the weighted model conferred greater predictive accuracy tended to be narrow (±0.005 units), and the optimal threshold tended to be stable over time. Our framework also indicated that variation in the recency of HIV diagnosis among clusters was significantly more predictive of new cases than sample collection dates (ΔAIC > 50). These results suggest that one cannot rely on historical precedence or convention to configure genetic clustering methods for public health applications, especially when translating methods between settings of low-level and generalized epidemics. Our framework not only enables investigators to calibrate a clustering method to a specific public health setting, but also provides a variable selection procedure to evaluate different predictive models of cluster growth.


2018 ◽  
Vol 56 (11) ◽  
pp. 1970-1978 ◽  
Author(s):  
Wayne Dimech ◽  
Marina Karakaltsas ◽  
Giuseppe A. Vincini

Abstract Background: A general trend towards conducting infectious disease serology testing in centralized laboratories means that quality control (QC) principles used for clinical chemistry testing are applied to infectious disease testing. However, no systematic assessment of methods used to establish QC limits has been applied to infectious disease serology testing. Methods: A total of 103 QC data sets, obtained from six different infectious disease serology analytes, were parsed through standard methods for establishing statistical control limits, including guidelines from Public Health England, USA Clinical and Laboratory Standards Institute (CLSI), German Richtlinien der Bundesärztekammer (RiliBÄK) and Australian QConnect. The percentage of QC results failing each method was compared. Results: The percentage of data sets having more than 20% of QC results failing Westgard rules when the first 20 results were used to calculate the mean±2 standard deviation (SD) ranged from 3 (2.9%) for R4S to 66 (64.1%) for 10X rule, whereas the percentage ranged from 0 (0%) for R4S to 32 (40.5%) for 10X when the first 100 results were used to calculate the mean±2 SD. By contrast, the percentage of data sets with >20% failing the RiliBÄK control limits was 25 (24.3%). Only two data sets (1.9%) had more than 20% of results outside the QConnect Limits. Conclusions: The rate of failure of QCs using QConnect Limits was more applicable for monitoring infectious disease serology testing compared with UK Public Health, CLSI and RiliBÄK, as the alternatives to QConnect Limits reported an unacceptably high percentage of failures across the 103 data sets.


2019 ◽  
Author(s):  
Connor Chato ◽  
Marcia L. Kalish ◽  
Art F. Y. Poon

AbstractGenetic clustering is a popular method for characterizing variation in transmission rates for rapidly-evolving viruses, and could potentially be used to detect outbreaks in ‘near real time’. However, the statistical properties of clustering are poorly understood in this context, and there are no objective guidelines for setting clustering criteria. Here we develop a new statistical framework to optimize a genetic clustering method based on the ability to forecast new cases. We analyzed the pairwise Tamura-Nei (TN93) genetic distances for anonymized HIV-1 subtype B pol sequences from Seattle (n = 1, 653) and Middle Tennessee, USA (n = 2, 779), and northern Alberta, Canada (n = 809). Under varying TN93 thresholds, we fit two models to the distributions of new cases relative to clusters of known cases: (1) a null model that assumes cluster growth is strictly proportional to cluster size, i.e., no variation in transmission rates among individuals; and (2) a weighted model that incorporates individual-level covariates, such as recency of diagnosis. The optimal threshold maximizes the difference in information loss between models, where covariates are used most effectively. Optimal TN93 thresholds varied substantially between data sets, e.g., 0.0104 in Alberta and 0.016 in Seattle and Tennessee, such that the optimum for one population will potentially mis-direct prevention efforts in another. The range of thresholds where the weighted model conferred greater predictive accuracy tended to be narrow (±0.005 units), but the optimal threshold for a given population also tended to be stable over time. We also extended our method to demonstrate that variation in recency of HIV diagnosis among clusters was significantly more predictive of new cases than sample collection dates (ΔAIC> 50). These results demonstrate that one cannot rely on historical precedence or convention to configure genetic clustering methods for public health applications. Our framework not only provides an objective procedure to optimize a clustering method, but can also be used for variable selection in forecasting new cases.


10.29007/38sk ◽  
2019 ◽  
Author(s):  
Yan Yan ◽  
Tin Nguyen ◽  
Bobby Bryant ◽  
Frederick C. Harris

Noise remains a particularly challenging and ubiquitous problem in cancer gene expression data clustering research, which may cause inaccurate results and mislead the underlying biological meanings. A clustering method that is robust to noise is highly desirable. No one clustering method performs best across all data sets despite a vast number of methods available. Cluster ensemble provides an approach to automatically combine results from multiple clustering methods for improving robustness and accuracy. We have proposed a novel noise robust fuzzy cluster ensemble algorithm. It employs an improved fuzzy clustering approach with different initializations as its base clusterings to avoid or alleviate the effects of noise in data sets. Its results show effective improvements over most examined noisy real cancer gene expression data sets when compared with most evaluated benchmark clustering methods: it is the top performer on three of the eight data sets, more than any other methods evaluated, and it performs well on most of the other data sets. Also, our fuzzy cluster ensemble is robust on highly noisy synthetic data sets. Moreover, it is computationally efficient.


Author(s):  
Eric U.O. ◽  
Michael O.O. ◽  
Oberhiri-Orumah G. ◽  
Chike H. N.

Cluster analysis is an unsupervised learning method that classifies data points, usually multidimensional into groups (called clusters) such that members of one cluster are more similar (in some sense) to each other than those in other clusters. In this paper, we propose a new k-means clustering method that uses Minkowski’s distance as its metric in a normed vector space which is the generalization of both the Euclidean distance and the Manhattan distance. The k-means clustering methods discussed in this paper are Forgy’s method, Lloyd’s method, MacQueen’s method, Hartigan and Wong’s method, Likas’ method and Faber’s method which uses the usual Euclidean distance. It was observed that the new k-means clustering method performed favourably in comparison with the existing methods in terms of minimization of the total intra-cluster variance using simulated data and real-life data sets.


2021 ◽  
Vol 257 ◽  
pp. 01032
Author(s):  
Dong Hong Huang ◽  
Dan Liu ◽  
Ming Wen ◽  
Xin Li Dong ◽  
Min Wen ◽  
...  

For the design and planning of gas-fired boiler system, the load of gas-fired boiler is an important basic data. Load clustering analysis, combined with the application of data mining technology and gas boiler system, excavates the hidden load patterns in a large number of disordered and irregular loads, and classifies them, so as to solve many problems in gas boiler system. The current load clustering methods have more or less problems. The invention first carries out data PVA dimension reduction processing on the huge gas data, and then carries out cluster analysis. In the actual application of gas-fired boilers, the data objects we are faced with are usually unbalanced data sets. In order to solve the problem of sample imbalance, we use the FCM-SMOTE algorithm to oversample the clustered data to make the data set into a balanced data set.


Sign in / Sign up

Export Citation Format

Share Document