Semiautomated Alignment of High-Throughput Metabolite Profiles with Chemometric Tools

Journal of Analytical Methods in Chemistry ◽

10.1155/2017/9402045 ◽

2017 ◽

Vol 2017 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Ze-ying Wu ◽

Zhong-da Zeng ◽

Zi-dan Xiao ◽

Daniel Kam-Wah Mok ◽

Yi-zeng Liang ◽

...

Keyword(s):

Data Processing ◽

High Throughput ◽

Biomarker Discovery ◽

Linear Interpolation ◽

Time Shift ◽

Data Partition ◽

High Throughput Data ◽

Automated Method ◽

Metabolite Profiles ◽

Standard Profile

The rapid increase in the use of metabolite profiling/fingerprinting techniques to resolve complicated issues in metabolomics has stimulated demand for data processing techniques, such as alignment, to extract detailed information. In this study, a new and automated method was developed to correct the retention time shift of high-dimensional and high-throughput data sets. Information from the target chromatographic profiles was used to determine the standard profile as a reference for alignment. A novel, piecewise data partition strategy was applied for the determination of the target components in the standard profile as markers for alignment. An automated target search (ATS) method was proposed to find the exact retention times of the selected targets in other profiles for alignment. The linear interpolation technique (LIT) was employed to align the profiles prior to pattern recognition, comprehensive comparison analysis, and other data processing steps. In total, 94 metabolite profiles of ginseng were studied, including the most volatile secondary metabolites. The method used in this article could be an essential step in the extraction of information from high-throughput data acquired in the study of systems biology, metabolomics, and biomarker discovery.

Download Full-text

A fast general-purpose clustering algorithm based on FPGAs for high-throughput data processing

Nuclear Instruments and Methods in Physics Research Section A Accelerators Spectrometers Detectors and Associated Equipment ◽

10.1016/j.nima.2009.10.046 ◽

2010 ◽

Vol 617 (1-3) ◽

pp. 254-257 ◽

Cited By ~ 5

Author(s):

A. Annovi ◽

M. Beretta

Keyword(s):

Data Processing ◽

High Throughput ◽

Clustering Algorithm ◽

General Purpose ◽

High Throughput Data

Download Full-text

Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods

BMC Medical Genomics ◽

10.1186/s12920-021-00957-4 ◽

2021 ◽

Vol 14 (S1) ◽

Author(s):

Zishuang Zhang ◽

Zhi-Ping Liu

Keyword(s):

Machine Learning ◽

Hepatocellular Carcinoma ◽

Feature Selection ◽

High Throughput ◽

Biomarker Discovery ◽

Selection Process ◽

Recursive Feature Elimination ◽

Statistical Interpretation ◽

Cancer Data ◽

High Throughput Data

Abstract Background Hepatocellular carcinoma (HCC) is one of the most common cancers. The discovery of specific genes severing as biomarkers is of paramount significance for cancer diagnosis and prognosis. The high-throughput omics data generated by the cancer genome atlas (TCGA) consortium provides a valuable resource for the discovery of HCC biomarker genes. Numerous methods have been proposed to select cancer biomarkers. However, these methods have not investigated the robustness of identification with different feature selection techniques. Methods We use six different recursive feature elimination methods to select the gene signiatures of HCC from TCGA liver cancer data. The genes shared in the six selected subsets are proposed as robust biomarkers. Akaike information criterion (AIC) is employed to explain the optimization process of feature selection, which provides a statistical interpretation for the feature selection in machine learning methods. And we use several methods to validate the screened biomarkers. Results In this paper, we propose a robust method for discovering biomarker genes for HCC from gene expression data. Specifically, we implement recursive feature elimination cross-validation (RFE-CV) methods based on six different classication algorithms. The overlaps in the discovered gene sets via different methods are referred as the identified biomarkers. We give an interpretation of the feature selection process based on machine learning using AIC in statistics. Furthermore, the features selected by the backward logistic stepwise regression via AIC minimum theory are completely contained in the identified biomarkers. Through the classification results, the superiority of interpretable robust biomarker discovery method is verified. Conclusions It is found that overlaps among gene subsets contain different quantitative features selected by the RFE-CV of 6 classifiers. The AIC values in the model selection provide a theoretical foundation for the feature selection process of biomarker discovery via machine learning. What’s more, genes containing in more optimally selected subsets make better biological sense and implication. The quality of feature selection is improved by the intersections of biomarkers selected from different classifiers. This is a general method suitable for screening biomarkers of complex diseases from high-throughput data.

Download Full-text

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

BMC Genomics ◽

10.1186/s12864-020-07013-y ◽

2020 ◽

Vol 21 (S10) ◽

Author(s):

Tanveer Ahmad ◽

Nauman Ahmed ◽

Zaid Al-Ars ◽

H. Peter Hofstee

Keyword(s):

Data Processing ◽

Shared Memory ◽

High Throughput ◽

Data Representation ◽

Data Sets ◽

Computing Systems ◽

Disk Storage ◽

High Throughput Data ◽

Data Framework ◽

Development Framework

Abstract Background Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.

Download Full-text

An Improved High-Throughput Data Processing Based on Combinatorial Materials Chip Approach for Rapid Construction of Fe–Cr–Ni Composition-Phase Map

ACS Combinatorial Science ◽

10.1021/acscombsci.9b00149 ◽

2019 ◽

Vol 21 (12) ◽

pp. 833-842 ◽

Cited By ~ 2

Author(s):

Zhaoyang Zhao ◽

Ying Jin ◽

Peng Shi ◽

Yanpeng Xue ◽

Bingbing Zhao ◽

...

Keyword(s):

Data Processing ◽

High Throughput ◽

High Throughput Data ◽

Rapid Construction ◽

Phase Map ◽

Combinatorial Materials

Download Full-text

Performance modeling in CUDA streams — A means for high-throughput data processing

2014 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2014.7004245 ◽

2014 ◽

Cited By ~ 12

Author(s):

Hao Li ◽

Di Yu ◽

Anand Kumar ◽

Yi-Cheng Tu

Keyword(s):

Data Processing ◽

High Throughput ◽

Performance Modeling ◽

High Throughput Data

Download Full-text

Low-Overhead Fault Tolerance for High-Throughput Data Processing Systems

2011 31st International Conference on Distributed Computing Systems ◽

10.1109/icdcs.2011.29 ◽

2011 ◽

Cited By ~ 18

Author(s):

Andre Martin ◽

Thomas Knauth ◽

Stephan Creutz ◽

Diogo Becker ◽

Stefan Weigert ◽

...

Keyword(s):

Fault Tolerance ◽

Data Processing ◽

High Throughput ◽

High Throughput Data

Download Full-text

FPGA based technical solutions for high throughput data processing and encryption for 5G communication: A review

TELKOMNIKA (Telecommunication Computing Electronics and Control) ◽

10.12928/telkomnika.v19i4.18400 ◽

2021 ◽

Vol 19 (4) ◽

pp. 1291

Author(s):

P. Visconti ◽

R. Velazquez ◽

Carolina Del-Valle Soto ◽

R. de Fazio

Keyword(s):

Data Processing ◽

High Throughput ◽

High Throughput Data ◽

5G Communication ◽

Technical Solutions

Download Full-text

Elucidating the Weakly Reversible Cs-Pb-Br Perovskite Nanocrystal Reaction Network with High-Throughput Maps and Transformations

10.26434/chemrxiv.12253277.v2 ◽

2020 ◽

Author(s):

Jakob Dahl ◽

Xingzhi Wang ◽

Xiao Huang ◽

Emory Chan ◽

Paul Alivisatos

Keyword(s):

High Throughput ◽

Dynamic Equilibrium ◽

Reaction Network ◽

Substantial Impact ◽

Equilibrium Behavior ◽

Automated Method ◽

Lead Bromide ◽

Different Shapes ◽

Complex Chemistry ◽

Lead Halide

<p>Advances in automation and data analytics can aid exploration of the complex chemistry of nanoparticles. Lead halide perovskite colloidal nanocrystals provide an interesting proving ground: there are reports of many different phases and transformations, which has made it hard to form a coherent conceptual framework for their controlled formation through traditional methods. In this work, we systematically explore the portion of Cs-Pb-Br synthesis space in which many optically distinguishable species are formed using high-throughput robotic synthesis to understand their formation reactions. We deploy an automated method that allows us to determine the relative amount of absorbance that can be attributed to each species in order to create maps of the synthetic space. These in turn facilitate improved understanding of the interplay between kinetic and thermodynamic factors that underlie which combination of species are likely to be prevalent under a given set of conditions. Based on these maps, we test potential transformation routes between perovskite nanocrystals of different shapes and phases. We find that shape is determined kinetically, but many reactions between different phases show equilibrium behavior. We demonstrate a dynamic equilibrium between complexes, monolayers and nanocrystals of lead bromide, with substantial impact on the reaction outcomes. This allows us to construct a chemical reaction network that qualitatively explains our results as well as previous reports and can serve as a guide for those seeking to prepare a particular composition and shape. </p>

Download Full-text