Big Data Processing on Volunteer Computing

Zhihan Lv; Dongliang Chen; Amit Kumar Singh

doi:10.1145/3409801

Big Data Processing on Volunteer Computing

ACM Transactions on Internet Technology ◽

10.1145/3409801 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1-20

Author(s):

Zhihan Lv ◽

Dongliang Chen ◽

Amit Kumar Singh

Keyword(s):

Big Data ◽

Complex Networks ◽

Shortest Path ◽

Message Passing ◽

Large Scale ◽

Computation Time ◽

Query Time ◽

Approximate Algorithm ◽

Calculation Time ◽

Approximate Shortest Path

In order to calculate the node big data contained in complex networks and realize the efficient calculation of complex networks, based on voluntary computing, taking ICE middleware as the communication medium, the loose coupling distributed framework DCBV based on voluntary computing is proposed. Then, the Master, Worker, and MiddleWare layers in the framework, and the development structure of a DCBV framework are designed. The task allocation and recovery strategy, message passing and communication mode, and fault tolerance processing are discussed. Finally, to calculate and verify parameters such as the average shortest path of the framework and shorten calculation time, an improved accurate shortest path algorithm, the N-SPFA algorithm, is proposed. Under different datasets, the node calculation and performance of the N-SPFA algorithm are explored. The algorithm is compared with four approximate shortest-path algorithms: Combined Link and Attribute (CLA), Lexicographic Breadth First Search (LBFS), Approximate algorithm of shortest path length based on center distance of area division (CDZ), and Hub Vertex of area and Core Expressway (HEA-CE). The results show that when the number of CPU threads is 4, the computation time of the DCBV framework is the shortest (514.63 ms). As the number of CPU cores increases, the overall computation time of the framework decreases gradually. For every 2 additional CPU cores, the number of tasks increases by 1. When the number of Worker nodes is 8 and the number of nodes is 1, the computation time of the framework is the shortest (210,979 ms), and the IO statistics data increase with the increase of Worker nodes. When the datasets are Undirected01 and Undirected02, the computation time of the N-SPFA algorithm is the shortest, which is 4520 ms and 7324 ms, respectively. However, the calculation time in the ca-condmat_undirected dataset is 175,292 ms, and the performance is slightly worse. Overall, however, the performance of the N-SPFA and SPFA algorithms is good. Therefore, the two algorithms are combined. For networks with less complexity, the computational scale coefficient of the SPFA algorithm can be set to 0.06, and for general networks, 0.2. When compared with other algorithms in different datasets, the pretreatment time, average query time, and overall query time of N-SPFA algorithm are the shortest, being 49.67 ms, 5.12 ms, and 94,720 ms, respectively. The accuracy (1.0087) and error rate (0.024) are also the best. In conclusion, voluntary computing can be applied to the processing of big data, which has a good reference significance for the distributed analysis of large-scale complex networks.

Download Full-text

A Multilevel Simplification Algorithm for Computing the Average Shortest-Path Length of Scale-Free Complex Network

Journal of Applied Mathematics ◽

10.1155/2014/154172 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6 ◽

Cited By ~ 3

Author(s):

Guoyong Mao ◽

Ning Zhang

Keyword(s):

Complex Network ◽

Shortest Path ◽

Path Length ◽

Large Scale ◽

Computation Time ◽

Scale Free Network ◽

Original Network ◽

Memory Space ◽

Scale Free ◽

Free Network

Computing the average shortest-path length (ASPL) of a large scale-free network needs much memory space and computation time. Based on the feature of scale-free network, we present a simplification algorithm by cutting the suspension points and the connected edges; the ASPL of the original network can be computed through that of the simplified network. We also present a multilevel simplification algorithm to get ASPL of the original network directly from that of the multisimplified network. Our experiment shows that these algorithms require less memory space and time in computing the ASPL of scale-free network, which makes it possible to analyze large networks that were previously impossible due to memory limitations.

Download Full-text

Parallel Implementation of Non-slicing Floorplans with MPI and OpenMP

10.32920/ryerson.14647368 ◽

2021 ◽

Author(s):

Oluvaseun Owojaiye

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Parallel Implementation ◽

Computation Time ◽

Sequential Algorithm ◽

Design Stage ◽

Single Chip ◽

Solution Quality ◽

Early Design Stage

Advancement in technology has brought considerable improvement to processor design and now manufacturers design multiple processors on a single chip. Supercomputers today consists of cluster of interconnected nodes that collaborate together to solve complex and advanced computation problems. Message Passing Interface and Open Multiprocessing are the popularly used programming models to optimize sequential codes by parallelizing them on the different multiprocessor architecture that exist today. In this thesis, we parallelize the non-slicing floorplan algorithm based on Multilevel Floorplanning/placement of large scale modules using B*tree (MB*tree) with MPI and OpenMP on distributed and shared memory architectures respectively. In VLSI (Very Large Scale Integration) design automation, floorplanning is an initial and vital task performed in the early design stage. Experimental results using MCNC benchmark circuits show that our parallel algorithm produced better results than the corresponding sequential algorithm; we were able to speed up the algorithm up to 4 times, hence reducing computation time and maintaining floorplan solution quality. On the other hand, we compared both parallel versions; and the OpenMP results gave slightly better than the corresponding MPI results.

Download Full-text

Toll Pricing and Heterogeneous Users

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198105192300104 ◽

2005 ◽

Vol 1923 (1) ◽

pp. 28-36 ◽

Cited By ~ 12

Author(s):

Hani S. Mahmassani ◽

Xuesong Zhou ◽

Chung-Cheng Lu

Keyword(s):

Shortest Path ◽

Large Scale ◽

Approximation Error ◽

Computation Time ◽

Time Dependent ◽

Approximation Schemes ◽

Value Of Time ◽

Solution Quality ◽

Shortest Path Problems ◽

Toll Pricing

This paper presents both exact and approximation algorithms for finding extreme efficient time-dependent shortest paths for use with dynamic traffic assignment applications to networks with variable toll pricing and heterogeneous users (with different value of time preferences). A parametric least-generalized cost path algorithm is presented to determine a complete set of extreme efficient time-dependent paths that simultaneously consider travel time and cost criteria. However, exact procedures may not be practical for large networks. For this reason, approximation schemes are devised and tested. Based on the concept of ε-efficiency in multiobjective shortest path problems, a binary search framework is developed to find a set of extreme efficient paths that minimize expected approximation error, with the use of the underlying value of time distribution. Both exact and approximation schemes (along with variants) are tested on three actual traffic networks. The experimental results indicate that the computation time and the size of the solution set are jointly determined by several key parameters such as the number of time intervals and the number of nodes in the network. The results also suggest that the proposed approximation scheme is computationally efficient for large-scale bi-objective time-dependent shortest path applications while maintaining satisfactory solution quality.

Download Full-text

The Theoretical and Experimental Analysis of the Maximal Information Coefficient Approximate Algorithm

Journal of Systems Science and Information ◽

10.21078/jssi-2021-095-10 ◽

2021 ◽

Vol 9 (1) ◽

pp. 95-104

Author(s):

Fubo Shao ◽

Hui Liu

Keyword(s):

Big Data ◽

Correlation Analysis ◽

Experimental Analysis ◽

Time Complexity ◽

Computation Time ◽

Approximate Algorithm ◽

Data Correlation ◽

Information Coefficient ◽

Maximal Information Coefficient ◽

Good Preparation

Abstract In the era of big data, correlation analysis is significant because it can quickly detect the correlation between factors. And then, it has been received much attention. Due to the good properties of generality and equitability of the maximal information coefficient (MIC), MIC is a hotspot in the research of correlation analysis. However, if the original approximate algorithm of MIC is directly applied into mining correlations in big data, the computation time is very long. Then the theoretical time complexity of the original approximate algorithm is analyzed in depth and the time complexity is n 2.4 when parameters are default. And the experiments show that the large number of candidate partitions of random relationships results in long computation time. The analysis is a good preparation for the next step work of designing new fast algorithms.

Download Full-text

Parallel Implementation of Non-slicing Floorplans with MPI and OpenMP

10.32920/ryerson.14647368.v1 ◽

2021 ◽

Author(s):

Oluvaseun Owojaiye

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Parallel Implementation ◽

Computation Time ◽

Sequential Algorithm ◽

Design Stage ◽

Single Chip ◽

Solution Quality ◽

Early Design Stage

Download Full-text

State of the Art Iterative Docking with Logistic Regression and Morgan Fingerprints

10.26434/chemrxiv.14348117 ◽

2021 ◽

Author(s):

Lewis Martin

Keyword(s):

Logistic Regression ◽

Message Passing ◽

Large Scale ◽

State Of The Art ◽

Computation Time ◽

Beta Lactamase ◽

Brute Force ◽

Virtual Libraries ◽

Ligand Discovery ◽

D4 Receptor

There is renewed interest in docking campaigns for ligand-discovery since the advent of ultra-large scale virtual libraries. Using brute-force search, the scale of the libraries suggests highly parallelized compute should be used to avoid years-long computations. This paper reports a re-analysis of docking data from an ultra-large docking campaign at the D4 receptor and AmpC beta lactamase, and demonstrates large reductions in computation time to identify the top-ranked ligands. A search of ‘baseline’ featurizations shows that logistic regression on Morgan fingerprints with pharmacophoric atom invariants can match the reported performance on the same task using message-passing networks. With this approach, an ultra-large docking campaign could be performed in a matter of weeks using consumer-grade CPUs with RDKit and scikit-learn. All code and figures are available at <a href="https://github.com/ljmartin/dockop">https://github.com/ljmartin/dockop</a>

Download Full-text

State of the Art Iterative Docking with Logistic Regression and Morgan Fingerprints

10.26434/chemrxiv.14348117.v1 ◽

2021 ◽

Author(s):

Lewis Martin

Keyword(s):

Logistic Regression ◽

Message Passing ◽

Large Scale ◽

State Of The Art ◽

Computation Time ◽

Beta Lactamase ◽

Brute Force ◽

Virtual Libraries ◽

Ligand Discovery ◽

D4 Receptor

Download Full-text

Segregation of Sensitive Data in Cloud Storage

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1339.029420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 367-369

Keyword(s):

Big Data ◽

Data Security ◽

Large Scale ◽

Computation Time ◽

Storage Management ◽

Third Party ◽

Sensitive Data ◽

Data Set ◽

High Data ◽

Cloud Server

Big data is a huge collection of data, which are larger in size. It assembles many techniques and technologies to uncover the needed values from a larger data set. Big data needs a large server to store the data which is higher in cost and also there is a need for maintenance. Cloud server can be a key for this problem. It has the capability of large scale storage management. But it is a third party service, so the apprehension here is the data security. Data can be secured from the cloud server by strong encryption methodologies. All data doesn’t need a high data security, so first we need to classify the data into sensitive and insensitive data. Sensitive data alone needs a proper attention over threats. This paper focuses on the identification of sensitive data within an acceptable computation time.

Download Full-text