An Event-Driven Serverless ETL Pipeline on AWS

Antreas Pogiatzis; Georgios Samakovitis

doi:10.3390/app11010191

An Event-Driven Serverless ETL Pipeline on AWS

Applied Sciences ◽

10.3390/app11010191 ◽

2020 ◽

Vol 11 (1) ◽

pp. 191

Author(s):

Antreas Pogiatzis ◽

Georgios Samakovitis

Keyword(s):

Data Processing ◽

Concurrency Control ◽

Data Transfer ◽

Data Delivery ◽

Reference Architecture ◽

Pricing Model ◽

Tabular Data ◽

Processing Power ◽

Maximum Payload ◽

Event Driven

This work presents an event-driven Extract, Transform, and Load (ETL) pipeline serverless architecture and provides an evaluation of its performance over a range of dataflow tasks of varying frequency, velocity, and payload size. We design an experiment while using generated tabular data throughout varying data volumes, event frequencies, and processing power in order to measure: (i) the consistency of pipeline executions; (ii) reliability on data delivery; (iii) maximum payload size per pipeline; and, (iv) economic scalability (cost of chargeable tasks). We run 92 parameterised experiments on a simple AWS architecture, thus avoiding any AWS-enhanced platform features, in order to allow for unbiased assessment of our model’s performance. Our results indicate that our reference architecture can achieve time-consistent data processing of event payloads of more than 100 MB, with a throughput of 750 KB/s across four event frequencies. It is also observed that, although the utilisation of an SQS queue for data transfer enables easy concurrency control and data slicing, it becomes a bottleneck on large sized event payloads. Finally, we develop and discuss a candidate pricing model for our reference architecture usage.

Download Full-text

Neuro-Inspired Signal Processing in Ferromagnetic Nanofibers

Biomimetics ◽

10.3390/biomimetics6020032 ◽

2021 ◽

Vol 6 (2) ◽

pp. 32

Author(s):

Tomasz Blachowicz ◽

Jacek Grzybowski ◽

Pawel Steblinski ◽

Andrea Ehrmann

Keyword(s):

Data Processing ◽

Data Storage ◽

Domain Walls ◽

Data Transfer ◽

Synaptic Activity ◽

Neuromorphic Computing ◽

Rotating Magnetic Fields ◽

Micromagnetic Simulations ◽

Energy Consuming ◽

Stochastic Data

Computers nowadays have different components for data storage and data processing, making data transfer between these units a bottleneck for computing speed. Therefore, so-called cognitive (or neuromorphic) computing approaches try combining both these tasks, as is done in the human brain, to make computing faster and less energy-consuming. One possible method to prepare new hardware solutions for neuromorphic computing is given by nanofiber networks as they can be prepared by diverse methods, from lithography to electrospinning. Here, we show results of micromagnetic simulations of three coupled semicircle fibers in which domain walls are excited by rotating magnetic fields (inputs), leading to different output signals that can be used for stochastic data processing, mimicking biological synaptic activity and thus being suitable as artificial synapses in artificial neural networks.

Download Full-text

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Applied Sciences ◽

10.3390/app8112216 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2216

Author(s):

Jiahui Jin ◽

Qi An ◽

Wei Zhou ◽

Jiakai Tang ◽

Runqun Xiong

Keyword(s):

Big Data ◽

Data Processing ◽

Processing Time ◽

Data Transfer ◽

Data Locality ◽

Free Time ◽

Time Data ◽

Dynamic Data ◽

Network Bandwidth ◽

Transfer Cost

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Download Full-text

Instruction Based Power Estimation Method in AURIX Micro controller

Journal of University of Shanghai for Science and Technology ◽

10.51201/jusst/21/05349 ◽

2021 ◽

Vol 23 (06) ◽

pp. 784-793

Author(s):

Kiran Guruprasad Shetty P S ◽

◽

Dr. Ravish Aradhya H V ◽

Keyword(s):

Data Processing ◽

Data Transfer ◽

Estimation Method ◽

Power Estimation ◽

New Method ◽

Abstraction Levels ◽

Program Software

Power estimation is a very prominent aspect in micro controllers which aims to to be more efficient in terms of power. A new method of estimation of power based on the execution of instruction in AURIX, which is an automotive micro- controller is proposed. The main aim of this method is to estimate the power in perspective of program(software) or instruction level which is constantly processed in microprocessor which is more accurate when compared with the previous methodologies. The estimation is done based on some set of instructions which is used in AURIX for Data transfer/storing in to memories, Data processing and Data Execution for various application. Most of the previous methodologies are all not accurate due to the abstraction levels.

Download Full-text

Peer Review #1 of "EPypes: a framework for building event-driven data processing pipelines (v0.2)"

10.7287/peerj-cs.176v0.2/reviews/1 ◽

2019 ◽

Keyword(s):

Data Processing ◽

Peer Review ◽

Event Driven

Download Full-text

Characterization of MPI Communication Primitives on a Heterogeneous Cluster

Scientific Research Journal ◽

10.24191/srj.v6i2.5631 ◽

2009 ◽

Vol 6 (2) ◽

pp. 23

Author(s):

Siti Arpah Ahmad ◽

Mohamed Faidz Mohamed Said ◽

Norazan Mohamed Ramli ◽

Mohd Nasir Taib

Keyword(s):

Message Passing ◽

High Performance ◽

Data Transfer ◽

The Other ◽

Small Cluster ◽

Heterogeneous Cluster ◽

Processing Power ◽

Basic Performance ◽

Point To Point

This paper focuses on the performance of basic communication primitives, namely the overlap of message transfer with computation in the point-to-point communication within a small cluster of four nodes. The mpptest has been implemented to measure the basic performance of MPI message passing routines with a variety of message sizes. The mpptest is capable of measuring performance with many participating processes thus exposing contention and scalability problems. This enables programmers to select message sizes in order to isolate and evaluate sudden changes in performance. Investigating these matters is interesting in that non-blocking calls have the advantage of allowing the system to schedule communications even when many processes are running simultaneously. On the other hand, understanding the characteristics of computation and communication overlap is significant, because high- performance kernels often strive to achieve this, since it is both advantageous with respect to data transfer and latency hiding. The results indicate that certain overlap sizes utilize greater node processing power either in blocking send and receive operations or non-blocking send and receive operations. The results have elucidated a detailed MPI characterization of the performance regarding the overlap of message transfer with computation in a small cluster system.

Download Full-text

Ubiquitous, Mobile and Pervasive Services

Strategic and Pragmatic E-Business - Advances in E-Business Research ◽

10.4018/978-1-4666-1619-6.ch009 ◽

2012 ◽

pp. 203-216

Author(s):

Kahkashan Tabassum ◽

Asia Sultana ◽

Avula Damaodaram

Keyword(s):

Optical Networks ◽

Stock Prices ◽

High Efficiency ◽

Data Dissemination ◽

Data Transfer ◽

Population Group ◽

Processing Power ◽

Wireless Environment ◽

Client Population ◽

Wide Range

The growing demand for wireless technology and related applications has impelled companies to invest profoundly in a wide range of wireless products such as laptops, notebooks, cellular phones, etc., to meet needs of broad range of customers’ requirements while maintaining high efficiency and data integrity. The Mobile Customers (MC) should be able to access the desired information such as news, weather reports, traffic updates, financial information, stock prices, etc. whenever and wherever they desire, but it is possible that they may have inconsistent data as they are not physically connected to the servers and hence they maintain a local cache that stores some amount of data that has been sent by the server. They may also prefetch data from the server for caching, depending on history for future use. The cached data should be consistent with the data in the data server in order to correctly serve the user. The critical constraints of a mobile device like limited network bandwidth, low battery power and low processing power of mobile devices make them more susceptible to inconsistencies. Broadcasting is the natural method for disseminating information in media: namely, shared Ethernet, optical networks, short-range wireless and wireless links, including satellites. It has the highest priority to disseminate information on the wireless network. Multicasting supports an enormous range of applications within a network and is an effective method to guarantee scalability of bulk data transfer in wireless environment. In a Multicast scenario, a single source sends data items, which are then replicated within the network infrastructure to reach a large client population (group). Therefore, it can be used to guarantee scalability, reliable data dissemination, timely and consistent content distribution.

Download Full-text

Event-Driven Service-Oriented Architectures for E-Business

Encyclopedia of E-Business Development and Management in the Global Economy ◽

10.4018/978-1-61520-611-7.ch095 ◽

2010 ◽

pp. 952-962

Author(s):

Olga Levina ◽

Vladimir Stantchev

Keyword(s):

Distributed System ◽

Enterprise Architecture ◽

Business Processes ◽

Service Oriented Architecture ◽

Reference Architecture ◽

Service Oriented Architectures ◽

Application Systems ◽

Event Driven ◽

Service Oriented ◽

Definition Of

E-Business research and practice can be situated on following multiple levels: applications, technological issues, support and implementation (Ngai and Wat 2002). Here we consider technological components for realizing business processes and discuss their foundation architecture for technological enabling. The article provides an introduction to the terms, techniques and realization issues for eventdriven and service-oriented architectures. We begin with a definition of terms and propose a reference architecture for an event-driven service-oriented architecture (EDSOA). Possible applications in the area of E-Business and solution guidelines are considered in the second part of the article. Service-oriented Architectures (SOA) have gained momentum since their introduction in the last years. Seen as an approach to integrate heterogeneous applications within an enterprise architecture they are also used to design flexible and adaptable business processes. An SOA is designed as a distributed system architecture providing a good integration possibility of already existing application systems. Furthermore, SOA is mostly suitable for complex and large system landscapes.

Download Full-text

Research on real-time data processing of visual production system based on event-driven

2012 7th IEEE Conference on Industrial Electronics and Applications (ICIEA) ◽

10.1109/iciea.2012.6361040 ◽

2012 ◽

Author(s):

Han Jialin ◽

Hu Yaoguang ◽

Mao Wenqiang ◽

Jia Pingping ◽

Li Jingwen

Keyword(s):

Data Processing ◽

Real Time ◽

Production System ◽

Time Data ◽

Real Time Data ◽

Event Driven ◽

Real Time Data Processing

Download Full-text

From source to sink - Sustainable and reproducible data pipelines with SaQC

10.5194/egusphere-egu2020-19648 ◽

2020 ◽

Author(s):

David Schäfer ◽

Bert Palm ◽

Lennart Schmidt ◽

Peter Lünenschloß ◽

Jan Bumberger

Keyword(s):

Quality Assurance ◽

Data Processing ◽

Software Reuse ◽

Data Transfer ◽

Large Set ◽

Real World Data ◽

Data Set ◽

Entry Barrier ◽

Data Lineage ◽

Source To Sink

The number of sensors used in the environmental system sciences is increasing rapidly, and while this trend undoubtedly provides a great potential to broaden the understanding of complex spatio-temporal processes, it comes with its own set of new challenges. The flow of data from a source to its sink, from sensors to databases, involves many, usually error prone intermediate steps. From the data acquisition with its specific scientific and technical challenges, over the data transfer from often remote locations to the final data processing, all carry great potential to introduce errors and disturbances into the actual environmental signal.Quantifying these errors becomes a crucial part of the later evaluation of all measured data. While many large environmental observatories are moving from manual to more automated ways of data processing and quality assurance, these systems are usually highly customized and hand written. This approach is non-ideal in several ways: First, it is a waste of resources as the same algorithms are implemented over and over again and second, it imposes great challenges to reproducibility. If the relevant programs are made available at all, they expose all problems of software reuse: correctness of the implementation, readability and comprehensibility for future users, as well as transferability between different computing environments. Beside these problems, related to software development in general, another crucial factor comes into play: the end product, a processed and quality controlled data set, is closely tied to the current version of the programs in use. Even small changes to the source code can lead to vastly differing results. If this is not approached responsibly, data and programs will inevitably fall out of sync.The presented software, the 'System for automated Quality Control (SaQC)' (www.ufz.git.de/rdm-software/saqc), helps to either solve, or massively simplify the solution to the presented challenges. As a mainly no-code platform with a large set of implemented functionality, SaQC lowers the entry barrier for the non-programming scientific practitioner, without sacrificing the possibilities to fine-grained adaptation to project specific needs. The text based configuration allows the easy integration into version control systems and thus opens the opportunity to use well established software for data lineage. We will give a short overview of the program's unique features and showcase possibilities to build reliable and reproducible processing and quality assurance pipelines for real-world data from a spatially distributed, heterogeneous sensor network.

Download Full-text

The Widening Gulf between Genomics Data Generation and Consumption: A Practical Guide to Big Data Transfer Technology

Bioinformatics and Biology Insights ◽

10.4137/bbi.s28988 ◽

2015 ◽

Vol 9s1 ◽

pp. BBI.S28988 ◽

Cited By ~ 7

Author(s):

Frank A. Feltus ◽

Joseph R. Breen ◽

Juan Deng ◽

Ryan S. Izard ◽

Christopher A. Konger ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

Data Storage ◽

Data Transfer ◽

Sequence Data ◽

Disruptive Technology ◽

Data Generation ◽

Key Aspects ◽

Data Flow Control ◽

Time Frames

In the last decade, high-throughput DNA sequencing has become a disruptive technology and pushed the life sciences into a distributed ecosystem of sequence data producers and consumers. Given the power of genomics and declining sequencing costs, biology is an emerging “Big Data” discipline that will soon enter the exabyte data range when all subdisciplines are combined. These datasets must be transferred across commercial and research networks in creative ways since sending data without thought can have serious consequences on data processing time frames. Thus, it is imperative that biologists, bioinformaticians, and information technology engineers recalibrate data processing paradigms to fit this emerging reality. This review attempts to provide a snapshot of Big Data transfer across networks, which is often overlooked by many biologists. Specifically, we discuss four key areas: 1) data transfer networks, protocols, and applications; 2) data transfer security including encryption, access, firewalls, and the Science DMZ; 3) data flow control with software-defined networking; and 4) data storage, staging, archiving and access. A primary intention of this article is to orient the biologist in key aspects of the data transfer process in order to frame their genomics-oriented needs to enterprise IT professionals.

Download Full-text