One and Two Dimensional Data Analysis Using Bezier Functions

Author(s):  
P. Venkataraman

Bezier functions, which are Bezier curves constrained to behave like functions, are excellent for representing smooth and continuous function of high degree, over the entire range of the independent variable. They provide excellent solutions to systems of linear, nonlinear, ordinary and partial differential equations. In this paper we examine their usefulness for data approximation as a prelude to their use in solving the inverse problem. Bezier function and B-splines, which are related, have mostly been used in geometrical modeling. There are not many examples of their use in data analysis. In this paper, organized and unorganized data are used to illustrate the effectiveness of Bezier functions for data approximation, reduction, mining, transformation, and prediction. Two criteria are used for the overall data fitting process. A simple incremental strategy identifies the order of the function using the minimum of the sum of the absolute error over all data. For a given order of the function, the least squared error over all data identifies the Bezier function through a non iterative algebraic relation. The entire data can then be represented by the coefficients of the Bezier function. Alternately, the data can also be reduced to a polynomial based on a parameter varying between 0 and 1. The Bezier function is global over all of the data so that all data points, including interpolated data, have the same properties. Three important properties are explicit is using Bezier functions for data analysis. The mean of the original data and the approximate data are the same. Large orders of polynomials can be used without local distortion. The independent and dependent variables can be decoupled by the Bezier representation. The data fitting process can also filter noisy data to recover principal data behavior.

2005 ◽  
Author(s):  
Nicholas J. Tustison ◽  
James Gee

Since the 1970’s B-splines have evolved to become the {} standard for curve and surface representation due to many of their salient properties. Conventional least-squares scattered data fitting techniques for B-splines require the inversion of potentially large matrices. This is time-consuming as well as susceptible to ill-conditioning which leads to undesired results. Lee {} proposed a novel B-spline algorithm for fitting a 2-D cubic B-spline surface to scattered data in . The proposed algorithm utilizes an optional multilevel approach for better fitting results. We generalize this technique to support N-dimensional data fitting as well as arbitrary degree of B-spline. In addition, we generalize the B-spline kernel function class to accommodate this new image filter.


2020 ◽  
Author(s):  
Lucian Chan ◽  
Garrett Morris ◽  
Geoffrey Hutchison

The calculation of the entropy of flexible molecules can be challenging, since the number of possible conformers grows exponentially with molecule size and many low-energy conformers may be thermally accessible. Different methods have been proposed to approximate the contribution of conformational entropy to the molecular standard entropy, including performing thermochemistry calculations with all possible stable conformations, and developing empirical corrections from experimental data. We have performed conformer sampling on over 120,000 small molecules generating some 12 million conformers, to develop models to predict conformational entropy across a wide range of molecules. Using insight into the nature of conformational disorder, our cross-validated physically-motivated statistical model can outperform common machine learning and deep learning methods, with a mean absolute error ≈4.8 J/mol•K, or under 0.4 kcal/mol at 300 K. Beyond predicting molecular entropies and free energies, the model implies a high degree of correlation between torsions in most molecules, often as- sumed to be independent. While individual dihedral rotations may have low energetic barriers, the shape and chemical functionality of most molecules necessarily correlate their torsional degrees of freedom, and hence restrict the number of low-energy conformations immensely. Our simple models capture these correlations, and advance our understanding of small molecule conformational entropy.


Author(s):  
Ying Wang ◽  
Yiding Liu ◽  
Minna Xia

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.


2020 ◽  
pp. 000370282097751
Author(s):  
Xin Wang ◽  
Xia Chen

Many spectra have a polynomial-like baseline. Iterative polynomial fitting (IPF) is one of the most popular methods for baseline correction of these spectra. However, the baseline estimated by IPF may have substantially error when the spectrum contains significantly strong peaks or have strong peaks located at the endpoints. First, IPF uses temporary baseline estimated from the current spectrum to identify peak data points. If the current spectrum contains strong peaks, then the temporary baseline substantially deviates from the true baseline. Some good baseline data points of the spectrum might be mistakenly identified as peak data points and are artificially re-assigned with a low value. Second, if a strong peak is located at the endpoint of the spectrum, then the endpoint region of the estimated baseline might have significant error due to overfitting. This study proposes a search algorithm-based baseline correction method (SA) that aims to compress sample the raw spectrum to a dataset with small number of data points and then convert the peak removal process into solving a search problem in artificial intelligence (AI) to minimize an objective function by deleting peak data points. First, the raw spectrum is smoothened out by the moving average method to reduce noise and then divided into dozens of unequally spaced sections on the basis of Chebyshev nodes. Finally, the minimal points of each section are collected to form a dataset for peak removal through search algorithm. SA selects the mean absolute error (MAE) as the objective function because of its sensitivity to overfitting and rapid calculation. The baseline correction performance of SA is compared with those of three baseline correction methods: Lieber and Mahadevan–Jansen method, adaptive iteratively reweighted penalized least squares method, and improved asymmetric least squares method. Simulated and real FTIR and Raman spectra with polynomial-like baselines are employed in the experiments. Results show that for these spectra, the baseline estimated by SA has fewer error than those by the three other methods.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 431
Author(s):  
Mike L. Smith ◽  
Andrzej K. Oleś ◽  
Wolfgang Huber

The Bioconductor Gateway on the F1000Research platform is a channel for peer-reviewed and citable publication of end-to-end data analysis workflows rooted in the Bioconductor ecosystem. In addition to the largely static journal publication, it is hoped that authors will also deposit their workflows as executable documents on Bioconductor, where the benefits of regular code testing and easy updating can be realized. Ideally these two endpoints would be met from a single source document. However, so far this has not been easy, due to lack of a technical solution that meets both the requirements of the F1000Research article submission format and the executable documents on Bioconductor. Submission to the platform requires a LaTeX file, which many authors traditionally have produced by writing an Rnw document for Sweave or knitr. On the other hand, to produce the HTML rendering of the document hosted by Bioconductor, the most straightforward starting point is the R Markdown format. Tools such as pandoc enable conversion between many formats, but typically a high degree of manual intervention used to be required to satisfactorily handle aspects such as floating figures, cross-references, literature references, and author affiliations. The BiocWorkflowTools package aims to solve this problem by enabling authors to work with R Markdown right up until the moment they wish to submit to the platform.


2021 ◽  
Author(s):  
Lucian Chan ◽  
Garrett Morris ◽  
Geoffrey Hutchison

The calculation of the entropy of flexible molecules can be challenging, since the number of possible conformers grows exponentially with molecule size and many low-energy conformers may be thermally accessible. Different methods have been proposed to approximate the contribution of conformational entropy to the molecular standard entropy, including performing thermochemistry calculations with all possible stable conformations, and developing empirical corrections from experimental data. We have performed conformer sampling on over 120,000 small molecules generating some 12 million conformers, to develop models to predict conformational entropy across a wide range of molecules. Using insight into the nature of conformational disorder, our cross-validated physically-motivated statistical model can outperform common machine learning and deep learning methods, with a mean absolute error ≈4.8 J/mol•K, or under 0.4 kcal/mol at 300 K. Beyond predicting molecular entropies and free energies, the model implies a high degree of correlation between torsions in most molecules, often as- sumed to be independent. While individual dihedral rotations may have low energetic barriers, the shape and chemical functionality of most molecules necessarily correlate their torsional degrees of freedom, and hence restrict the number of low-energy conformations immensely. Our simple models capture these correlations, and advance our understanding of small molecule conformational entropy.


2010 ◽  
pp. 1797-1803
Author(s):  
Lisa Friedland

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure (3), we use an anomaly detection approach: develop a model to describe the data (1), then identify outliers (2).


Author(s):  
Diane J. Cook ◽  
Lawrence B. Holder

The large amount of data collected today is quickly overwhelming researchers’ abilities to interpret the data and discover interesting patterns. In response to this problem, a number of researchers have developed techniques for discovering concepts in databases. These techniques work well for data expressed in a nonstructural, attribute-value representation and address issues of data relevance, missing data, noise and uncertainty, and utilization of domain knowledge (Fisher, 1987; Cheeseman and Stutz, 1996). However, recent data acquisition projects are collecting structural data describing the relationships among the data objects. Correspondingly, there exists a need for techniques to analyze and discover concepts in structural databases (Fayyad et al., 1996b). One method for discovering knowledge in structural data is the identification of common substructures. The goal is to find substructures capable of compressing the data and to identify conceptually interesting substructures that enhance the interpretation of the data. Substructure discovery is the process of identifying concepts describing interesting and repetitive substructures within structural data. Once discovered, the substructure concept can be used to simplify the data by replacing instances of the substructure with a pointer to the newly discovered concept. The discovered substructure concepts allow abstraction over detailed structure in the original data and provide new, relevant attributes for interpreting the data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the goals of the data analysis. We describe a system called Subdue that discovers interesting substructures in structural data based on the minimum description length (MDL) principle. The Subdue system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously discovered substructures, multiple passes of Subdue produce a hierarchical description of the structural regularities in the data. Subdue uses a computationally bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints.


2012 ◽  
Vol 106 (9) ◽  
pp. 543-554 ◽  
Author(s):  
Derrick W. Smith ◽  
Sinikka M. Smothers

IntroductionThe purpose of the study presented here was to determine how well tactile graphics (specifically data analysis graphs) in secondary mathematics and science braille textbooks correlated with the print graphics.MethodA content analysis was conducted on 598 separate data analysis graphics from 10 mathematics and science textbooks. The researchers (the authors) cross-validated the findings through a comparative analysis of the tactile graphics of five shared textbooks.ResultsDiscrepancies were found between the print graphic and the tactile graphic in 12.5% of the sample. The most common discrepancy was differences in how data lines and data points were individualized in the print graphic compared to the tactile graphic. On the basis of the reviews of the graphics, the researchers answered a 5-point Likert-scale question (from 1 = strongly disagree to 5 = strongly agree) asking if the “tactile graphic is a valid representation of the print graphic.” The overall score for the sample was 3.71 (SD = 1.60), with a Krippendorff alpha of 0.6328 (the measure of disagreement and alpha > 0.70 are consider moderate).DiscussionThe findings demonstrate that while the majority of tactile graphics have good correlations to their print counterparts, there is still room for improvement. Some transcribers omitted a tactile graphic without providing a reason. Forty graphics (6.7%) were omitted from the braille transcription. Two textbooks were missing more than 85% of the tactile graphics of the data graphs.Implications for PractitionersTactile graphics in math and science books are important for a student to understand. Although most transcribers do an excellent job of creating valid tactile graphics, problems with many graphics still exist in textbooks. Practitioners need constantly to review the tactile graphics that are used in all classrooms and be prepared to create their own if needed.


Sign in / Sign up

Export Citation Format

Share Document