One and Two Dimensional Data Analysis Using Bezier Functions

Since the 1970’s B-splines have evolved to become the {} standard for curve and surface representation due to many of their salient properties. Conventional least-squares scattered data fitting techniques for B-splines require the inversion of potentially large matrices. This is time-consuming as well as susceptible to ill-conditioning which leads to undesired results. Lee {} proposed a novel B-spline algorithm for fitting a 2-D cubic B-spline surface to scattered data in . The proposed algorithm utilizes an optional multilevel approach for better fitting results. We generalize this technique to support N-dimensional data fitting as well as arbitrary degree of B-spline. In addition, we generalize the B-spline kernel function class to accommodate this new image filter.

Download Full-text

Understanding Conformational Entropy in Small Molecules

10.26434/chemrxiv.12671027 ◽

2020 ◽

Author(s):

Lucian Chan ◽

Garrett Morris ◽

Geoffrey Hutchison

Keyword(s):

Small Molecules ◽

Degrees Of Freedom ◽

Absolute Error ◽

Low Energy ◽

Standard Entropy ◽

Free Energies ◽

Conformational Entropy ◽

Wide Range ◽

High Degree ◽

Empirical Corrections

The calculation of the entropy of flexible molecules can be challenging, since the number of possible conformers grows exponentially with molecule size and many low-energy conformers may be thermally accessible. Different methods have been proposed to approximate the contribution of conformational entropy to the molecular standard entropy, including performing thermochemistry calculations with all possible stable conformations, and developing empirical corrections from experimental data. We have performed conformer sampling on over 120,000 small molecules generating some 12 million conformers, to develop models to predict conformational entropy across a wide range of molecules. Using insight into the nature of conformational disorder, our cross-validated physically-motivated statistical model can outperform common machine learning and deep learning methods, with a mean absolute error ≈4.8 J/mol•K, or under 0.4 kcal/mol at 300 K. Beyond predicting molecular entropies and free energies, the model implies a high degree of correlation between torsions in most molecules, often as- sumed to be independent. While individual dihedral rotations may have low energetic barriers, the shape and chemical functionality of most molecules necessarily correlate their torsional degrees of freedom, and hence restrict the number of low-energy conformations immensely. Our simple models capture these correlations, and advance our understanding of small molecule conformational entropy.

Download Full-text

Construction of a multi-source heterogeneous hybrid platform for big data

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-215138 ◽

2021 ◽

pp. 1-10

Author(s):

Ying Wang ◽

Yiding Liu ◽

Minna Xia

Keyword(s):

Big Data ◽

Data Analysis ◽

Forest Fire ◽

Original Data ◽

Big Data Analysis ◽

Multiple Sources ◽

Data Types ◽

Fire Monitoring ◽

Data Platform

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.

Download Full-text

EXPRESS: Baseline Correction Based on a Search Algorithm from Artificial Intelligence

Applied Spectroscopy ◽

10.1177/0003702820977512 ◽

2020 ◽

pp. 000370282097751

Author(s):

Xin Wang ◽

Xia Chen

Keyword(s):

Artificial Intelligence ◽

Objective Function ◽

Least Squares ◽

Least Squares Method ◽

Search Algorithm ◽

Absolute Error ◽

Search Problem ◽

Baseline Correction ◽

Current Spectrum ◽

Data Points

Many spectra have a polynomial-like baseline. Iterative polynomial fitting (IPF) is one of the most popular methods for baseline correction of these spectra. However, the baseline estimated by IPF may have substantially error when the spectrum contains significantly strong peaks or have strong peaks located at the endpoints. First, IPF uses temporary baseline estimated from the current spectrum to identify peak data points. If the current spectrum contains strong peaks, then the temporary baseline substantially deviates from the true baseline. Some good baseline data points of the spectrum might be mistakenly identified as peak data points and are artificially re-assigned with a low value. Second, if a strong peak is located at the endpoint of the spectrum, then the endpoint region of the estimated baseline might have significant error due to overfitting. This study proposes a search algorithm-based baseline correction method (SA) that aims to compress sample the raw spectrum to a dataset with small number of data points and then convert the peak removal process into solving a search problem in artificial intelligence (AI) to minimize an objective function by deleting peak data points. First, the raw spectrum is smoothened out by the moving average method to reduce noise and then divided into dozens of unequally spaced sections on the basis of Chebyshev nodes. Finally, the minimal points of each section are collected to form a dataset for peak removal through search algorithm. SA selects the mean absolute error (MAE) as the objective function because of its sensitivity to overfitting and rapid calculation. The baseline correction performance of SA is compared with those of three baseline correction methods: Lieber and MahadevanâJansen method, adaptive iteratively reweighted penalized least squares method, and improved asymmetric least squares method. Simulated and real FTIR and Raman spectra with polynomial-like baselines are employed in the experiments. Results show that for these spectra, the baseline estimated by SA has fewer error than those by the three other methods.

Download Full-text

Data Fitting with B-Splines

Geometric Modeling with Splines ◽

10.1201/9781439864203-18 ◽

2001 ◽

pp. 281-312

Keyword(s):

Data Fitting ◽

B Splines

Download Full-text

Authoring Bioconductor workflows with BiocWorkflowTools

F1000Research ◽

10.12688/f1000research.14399.1 ◽

2018 ◽

Vol 7 ◽

pp. 431

Author(s):

Mike L. Smith ◽

Andrzej K. Oleś ◽

Wolfgang Huber

Keyword(s):

Data Analysis ◽

The Other ◽

Technical Solution ◽

Single Source ◽

Journal Publication ◽

Source Document ◽

Manual Intervention ◽

Starting Point ◽

The Moment ◽

High Degree

The Bioconductor Gateway on the F1000Research platform is a channel for peer-reviewed and citable publication of end-to-end data analysis workflows rooted in the Bioconductor ecosystem. In addition to the largely static journal publication, it is hoped that authors will also deposit their workflows as executable documents on Bioconductor, where the benefits of regular code testing and easy updating can be realized. Ideally these two endpoints would be met from a single source document. However, so far this has not been easy, due to lack of a technical solution that meets both the requirements of the F1000Research article submission format and the executable documents on Bioconductor. Submission to the platform requires a LaTeX file, which many authors traditionally have produced by writing an Rnw document for Sweave or knitr. On the other hand, to produce the HTML rendering of the document hosted by Bioconductor, the most straightforward starting point is the R Markdown format. Tools such as pandoc enable conversion between many formats, but typically a high degree of manual intervention used to be required to satisfactorily handle aspects such as floating figures, cross-references, literature references, and author affiliations. The BiocWorkflowTools package aims to solve this problem by enabling authors to work with R Markdown right up until the moment they wish to submit to the platform.

Download Full-text

Understanding Conformational Entropy in Small Molecules

10.26434/chemrxiv.12671027.v2 ◽

2021 ◽

Author(s):

Lucian Chan ◽

Garrett Morris ◽

Geoffrey Hutchison

Keyword(s):

Small Molecules ◽

Degrees Of Freedom ◽

Absolute Error ◽

Low Energy ◽

Standard Entropy ◽

Free Energies ◽

Conformational Entropy ◽

Wide Range ◽

High Degree ◽

Empirical Corrections

The calculation of the entropy of flexible molecules can be challenging, since the number of possible conformers grows exponentially with molecule size and many low-energy conformers may be thermally accessible. Different methods have been proposed to approximate the contribution of conformational entropy to the molecular standard entropy, including performing thermochemistry calculations with all possible stable conformations, and developing empirical corrections from experimental data. We have performed conformer sampling on over 120,000 small molecules generating some 12 million conformers, to develop models to predict conformational entropy across a wide range of molecules. Using insight into the nature of conformational disorder, our cross-validated physically-motivated statistical model can outperform common machine learning and deep learning methods, with a mean absolute error ≈4.8 J/mol•K, or under 0.4 kcal/mol at 300 K. Beyond predicting molecular entropies and free energies, the model implies a high degree of correlation between torsions in most molecules, often as- sumed to be independent. While individual dihedral rotations may have low energetic barriers, the shape and chemical functionality of most molecules necessarily correlate their torsional degrees of freedom, and hence restrict the number of low-energy conformations immensely. Our simple models capture these correlations, and advance our understanding of small molecule conformational entropy.

Download Full-text

Anomaly Detection for Inferring Social Structure

Social Computing ◽

10.4018/978-1-60566-984-7.ch118 ◽

2010 ◽

pp. 1797-1803

Author(s):

Lisa Friedland

Keyword(s):

Data Analysis ◽

Anomaly Detection ◽

Social Structure ◽

Small Groups ◽

Analysis Data ◽

Data Sets ◽

Complex Data ◽

Detection Approach ◽

Complex Data Sets ◽

Data Points

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure (3), we use an anomaly detection approach: develop a model to describe the data (1), then identify outliers (2).

Download Full-text

Discovering Concepts in Structural Data

Pattern Discovery in Biomolecular Data ◽

10.1093/oso/9780195119404.003.0016 ◽

1999 ◽

Author(s):

Diane J. Cook ◽

Lawrence B. Holder

Keyword(s):

Data Analysis ◽

Missing Data ◽

Data Acquisition ◽

Domain Knowledge ◽

Minimum Description Length ◽

Structural Data ◽

Original Data ◽

Graph Match ◽

Replacement Process ◽

Data Objects

The large amount of data collected today is quickly overwhelming researchers’ abilities to interpret the data and discover interesting patterns. In response to this problem, a number of researchers have developed techniques for discovering concepts in databases. These techniques work well for data expressed in a nonstructural, attribute-value representation and address issues of data relevance, missing data, noise and uncertainty, and utilization of domain knowledge (Fisher, 1987; Cheeseman and Stutz, 1996). However, recent data acquisition projects are collecting structural data describing the relationships among the data objects. Correspondingly, there exists a need for techniques to analyze and discover concepts in structural databases (Fayyad et al., 1996b). One method for discovering knowledge in structural data is the identification of common substructures. The goal is to find substructures capable of compressing the data and to identify conceptually interesting substructures that enhance the interpretation of the data. Substructure discovery is the process of identifying concepts describing interesting and repetitive substructures within structural data. Once discovered, the substructure concept can be used to simplify the data by replacing instances of the substructure with a pointer to the newly discovered concept. The discovered substructure concepts allow abstraction over detailed structure in the original data and provide new, relevant attributes for interpreting the data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the goals of the data analysis. We describe a system called Subdue that discovers interesting substructures in structural data based on the minimum description length (MDL) principle. The Subdue system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously discovered substructures, multiple passes of Subdue produce a hierarchical description of the structural regularities in the data. Subdue uses a computationally bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints.

Download Full-text

The Role and Characteristics of Tactile Graphics in Secondary Mathematics and Science Textbooks in Braille

Journal of Visual Impairment & Blindness ◽

10.1177/0145482x1210600905 ◽

2012 ◽

Vol 106 (9) ◽

pp. 543-554 ◽

Cited By ~ 14

Author(s):

Derrick W. Smith ◽

Sinikka M. Smothers

Keyword(s):

Content Analysis ◽

Data Analysis ◽

Comparative Analysis ◽

Secondary Mathematics ◽

Likert Scale ◽

Science Textbooks ◽

Data Points ◽

Strongly Agree ◽

Mathematics And Science ◽

Tactile Graphics

IntroductionThe purpose of the study presented here was to determine how well tactile graphics (specifically data analysis graphs) in secondary mathematics and science braille textbooks correlated with the print graphics.MethodA content analysis was conducted on 598 separate data analysis graphics from 10 mathematics and science textbooks. The researchers (the authors) cross-validated the findings through a comparative analysis of the tactile graphics of five shared textbooks.ResultsDiscrepancies were found between the print graphic and the tactile graphic in 12.5% of the sample. The most common discrepancy was differences in how data lines and data points were individualized in the print graphic compared to the tactile graphic. On the basis of the reviews of the graphics, the researchers answered a 5-point Likert-scale question (from 1 = strongly disagree to 5 = strongly agree) asking if the “tactile graphic is a valid representation of the print graphic.” The overall score for the sample was 3.71 (SD = 1.60), with a Krippendorff alpha of 0.6328 (the measure of disagreement and alpha > 0.70 are consider moderate).DiscussionThe findings demonstrate that while the majority of tactile graphics have good correlations to their print counterparts, there is still room for improvement. Some transcribers omitted a tactile graphic without providing a reason. Forty graphics (6.7%) were omitted from the braille transcription. Two textbooks were missing more than 85% of the tactile graphics of the data graphs.Implications for PractitionersTactile graphics in math and science books are important for a student to understand. Although most transcribers do an excellent job of creating valid tactile graphics, problems with many graphics still exist in textbooks. Practitioners need constantly to review the tactile graphics that are used in all classrooms and be prepared to create their own if needed.

Download Full-text