A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

Yun Li; Yongyao Jiang; Juan Gu; Mingyue Lu; Manzhu Yu; Edward Armstrong; Thomas Huang; David Moroni; Lewis McGibbney; Greguska Frank; Chaowei Yang

doi:10.3390/app9061114

A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

Applied Sciences ◽

10.3390/app9061114 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1114 ◽

Cited By ~ 4

Author(s):

Yun Li ◽

Yongyao Jiang ◽

Juan Gu ◽

Mingyue Lu ◽

Manzhu Yu ◽

...

Keyword(s):

Large Scale ◽

Apache Spark ◽

Observation Data ◽

Data Partition ◽

Data Discovery ◽

Web Usage ◽

Imbalance Problem ◽

Speed Up ◽

Log Mining ◽

Access Logs

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.

Download Full-text

Modeling Gamma-Ray SEDs and Angular Extensions of Extreme TeV Blazars from Intergalactic Proton-Initiated Cascades in Contemporary Astrophysical EGMF Models

Universe ◽

10.3390/universe7070220 ◽

2021 ◽

Vol 7 (7) ◽

pp. 220

Author(s):

Emil Khalikov

Keyword(s):

Gamma Rays ◽

Large Scale ◽

Gamma Ray ◽

High Energy ◽

Cascade Model ◽

Point Sources ◽

Spectral Energy ◽

Observation Data ◽

Cherenkov Telescopes ◽

The Universe

The intrinsic spectra of some distant blazars known as “extreme TeV blazars” have shown a hint at an anomalous hardening in the TeV energy region. Several extragalactic propagation models have been proposed to explain this possible excess transparency of the Universe to gamma-rays starting from a model which assumes the existence of so-called axion-like particles (ALPs) and the new process of gamma-ALP oscillations. Alternative models suppose that some of the observable gamma-rays are produced in the intergalactic cascades. This work focuses on investigating the spectral and angular features of one of the cascade models, the Intergalactic Hadronic Cascade Model (IHCM) in the contemporary astrophysical models of Extragalactic Magnetic Field (EGMF). For IHCM, EGMF largely determines the deflection of primary cosmic rays and electrons of intergalactic cascades and, thus, is of vital importance. Contemporary Hackstein models are considered in this paper and compared to the model of Dolag. The models assumed are based on simulations of the local part of large-scale structure of the Universe and differ in the assumptions for the seed field. This work provides spectral energy distributions (SEDs) and angular extensions of two extreme TeV blazars, 1ES 0229+200 and 1ES 0414+009. It is demonstrated that observable SEDs inside a typical point spread function of imaging atmospheric Cherenkov telescopes (IACTs) for IHCM would exhibit a characteristic high-energy attenuation compared to the ones obtained in hadronic models that do not consider EGMF, which makes it possible to distinguish among these models. At the same time, the spectra for IHCM models would have longer high energy tails than some available spectra for the ALP models and the universal spectra for the Electromagnetic Cascade Model (ECM). The analysis of the IHCM observable angular extensions shows that the sources would likely be identified by most IACTs not as point sources but rather as extended ones. These spectra could later be compared with future observation data of such instruments as Cherenkov Telescope Array (CTA) and LHAASO.

Download Full-text

Temporal concatenation for Markov decision processes

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964821000206 ◽

2021 ◽

pp. 1-28

Author(s):

Ruiyang Song ◽

Kuang Xu

Keyword(s):

Markov Decision Processes ◽

Large Scale ◽

Optimal Solution ◽

Upper Bounds ◽

Black Box ◽

Decision Processes ◽

Optimal Solutions ◽

Wide Range ◽

Markov Decision ◽

Speed Up

We propose and analyze a temporal concatenation heuristic for solving large-scale finite-horizon Markov decision processes (MDP), which divides the MDP into smaller sub-problems along the time horizon and generates an overall solution by simply concatenating the optimal solutions from these sub-problems. As a “black box” architecture, temporal concatenation works with a wide range of existing MDP algorithms. Our main results characterize the regret of temporal concatenation compared to the optimal solution. We provide upper bounds for general MDP instances, as well as a family of MDP instances in which the upper bounds are shown to be tight. Together, our results demonstrate temporal concatenation's potential of substantial speed-up at the expense of some performance degradation.

Download Full-text

Land Use Transitions and Farm Performance in China: A Perspective of Land Fragmentation

Land ◽

10.3390/land10080792 ◽

2021 ◽

Vol 10 (8) ◽

pp. 792

Author(s):

Shukun Wang ◽

Dengwang Li ◽

Tingting Li ◽

Changquan Liu

Keyword(s):

Large Scale ◽

Household Survey ◽

Scale Production ◽

Land Fragmentation ◽

Land Transfer ◽

Large Scale Production ◽

Speed Up ◽

Farm Performance ◽

Effect On Yield ◽

The Relationship

Land fragmentation (LF) is widespread worldwide and affects farmers’ decision-making and, thus, farm performance. We used detailed household survey data at the crop level from ten provinces in China to construct four LF indicators and six farm performance indicators. We ran a set of regression models using OLS methods to analyse the relationship between LF and farm performance. The results showed that (1) LF increased the input of production material and labour costs; (2) LF reduced farmers’ purchasing of mechanical services and the efficiency of ploughing; and (3) LF may increase technical efficiency (this result, however, was not sufficiently robust and had no effect on yield). Generally speaking, LF was negatively related to farm performance. To improve farm performance, it is recommended that decision-makers speed up land transfer and land consolidation, stabilise land property rights, establish land-transfer intermediary organisations and promote large-scale production.

Download Full-text

A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7120472 ◽

2018 ◽

Vol 7 (12) ◽

pp. 472 ◽

Cited By ~ 1

Author(s):

Bo Wan ◽

Lin Yang ◽

Shunping Zhou ◽

Run Wang ◽

Dezhi Wang ◽

...

Keyword(s):

Road Network ◽

Large Scale ◽

Graphics Processing Unit ◽

Road Networks ◽

Processing Unit ◽

Data Partition ◽

Matching Method ◽

The Road ◽

Central Processing ◽

Relaxation Matching

The road-network matching method is an effective tool for map integration, fusion, and update. Due to the complexity of road networks in the real world, matching methods often contain a series of complicated processes to identify homonymous roads and deal with their intricate relationship. However, traditional road-network matching algorithms, which are mainly central processing unit (CPU)-based approaches, may have performance bottleneck problems when facing big data. We developed a particle-swarm optimization (PSO)-based parallel road-network matching method on graphics-processing unit (GPU). Based on the characteristics of the two main stages (similarity computation and matching-relationship identification), data-partition and task-partition strategies were utilized, respectively, to fully use GPU threads. Experiments were conducted on datasets with 14 different scales. Results indicate that the parallel PSO-based matching algorithm (PSOM) could correctly identify most matching relationships with an average accuracy of 84.44%, which was at the same level as the accuracy of a benchmark—the probability-relaxation-matching (PRM) method. The PSOM approach significantly reduced the road-network matching time in dealing with large amounts of data in comparison with the PRM method. This paper provides a common parallel algorithm framework for road-network matching algorithms and contributes to integration and update of large-scale road-networks.

Download Full-text

What Makes a Review Encouraging: Feature Analysis of User Access Logs in a Large-scale Online Movie Review Site

10.1145/3487664.3487775 ◽

2021 ◽

Author(s):

Kakeru Ito ◽

Yoshiyuki Shoji ◽

Sumio Fujita ◽

Martin J. Dürst

Keyword(s):

Large Scale ◽

Feature Analysis ◽

User Access ◽

Access Logs

Download Full-text

Characteristics of Large-Scale Orographic Precipitation in a Linear Perspective

Journal of Hydrometeorology ◽

10.1175/2010jhm1231.1 ◽

2011 ◽

Vol 12 (1) ◽

pp. 27-44 ◽

Cited By ~ 8

Author(s):

Michael Kunz

Keyword(s):

Time Scales ◽

Characteristic Time ◽

Large Scale ◽

Linear Perspective ◽

Horizontal Wind ◽

Small Scale ◽

Orographic Precipitation ◽

Observation Data ◽

Horizontal Wind Speed ◽

Precipitation Patterns

Abstract Simulations of orographic precipitation over the low mountain ranges of southwestern Germany and eastern France with two different physics-based linear precipitation models are presented. Both models are based on 3D airflow dynamics from linear theory and consider advection of condensed water and leeside drying. Sensitivity studies for idealized conditions and a real case study show that the amount and spatial distribution of orographic precipitation is strongly controlled by characteristic time scales for cloud and hydrometeor advection and background precipitation due to large-scale lifting. These parameters are estimated by adjusting the model results on a 2.5-km grid to observed precipitation patterns for a sample of 40 representative orography-dominated stratiform events (24 h) during a calibration period (1971–80). In general, the best results in terms of lowest rmse and bias are obtained for characteristic time scales of 1600 s and background precipitation of 0.4 mm h−1. Model simulations of a sample of 84 events during an application period (1981–2000) with fixed parameters demonstrate that both models are able to reproduce quantitatively precipitation patterns obtained from observations and reanalyses from a numerical model [Consortium for Small-scale Modeling (COSMO)]. Combining model results with observation data shows that heavy precipitations over mountains are restricted to situations with strong atmospheric forcings in terms of synoptic-scale lifting, horizontal wind speed, and moisture content.

Download Full-text

Oxygen as a Torrefaction Control Parameter

Energy and Environment Research ◽

10.5539/eer.v8n1p18 ◽

2018 ◽

Vol 8 (1) ◽

pp. 18

Author(s):

Kees Bourgonje ◽

Hubert J. Veringa ◽

David M.J. Smeulders ◽

Jeroen A. van Oijen

Keyword(s):

Large Scale ◽

Control Method ◽

Beech Wood ◽

Thermal Runaway ◽

Gas Production ◽

Process Temperature ◽

Reactor Wall ◽

Speed Up ◽

Temperature Window ◽

Stable Processing

To speed up the torrefaction process in traditional torrefaction reactors, in particular auger reactors, the temperature of the reactor is substantially higher than the required torrefaction process temperature. This is due to the low heat conductivity of biomass. Unfortunately, the off-gas characteristics of biomass are very sensitive in the temperature window of 180-300°C which can cause a thermal runaway situation in which the process temperature exceeds the intended level. Due to this very sensitive temperature dependence of biomass pyrolysis and its accompanying gas production, a potential solution is to inject small amounts of air directly into the torrefaction reactor. It is found experimentally that this air injection can regulate the temperature of the biomass very rapidly compared to traditional temperature regulation by changing the reactor wall temperature. With this new torrefaction temperature control method, thermal runaway situations can be avoided and the temperature of the biomass in the reactor can be regulated better. Experiments with large beech wood samples show that the torrefaction reaction rate and the temperature in the core of the sample depend on the amount of injected air. Since the flow of combustible gasses (torr-gas) originating from the torrefaction process is very sensitive to temperature, the heat production by combusting the torr-gas can be controlled to some extent. This will result in both a more homogeneous torrefied product as well as a more stable processing of varying biomass types in large-scale torrefaction systems.

Download Full-text

A parallel graph partitioning algorithm to speed up the large-scale distributed graph mining

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining Algorithms, Systems, Programming Models and Applications - BigMine '12 ◽

10.1145/2351316.2351325 ◽

2012 ◽

Cited By ~ 4

Author(s):

ZengFeng Zeng ◽

Bin Wu ◽

Haoyu Wang

Keyword(s):

Graph Partitioning ◽

Graph Mining ◽

Large Scale ◽

Speed Up ◽

Partitioning Algorithm ◽

Parallel Graph

Download Full-text

Some Research Issues of Harmful and Violent Content Filtering for Social Networks in the Context of Large-Scale and Streaming Data with Apache Spark

Recent Advances in Security, Privacy, and Trust for Internet of Things (IoT) and Cyber-Physical Systems (CPS) ◽

10.1201/9780429270567-11 ◽

2020 ◽

pp. 249-272

Author(s):

Phuc Do ◽

Phu Pham ◽

Trung Phan

Keyword(s):

Social Networks ◽

Large Scale ◽

Streaming Data ◽

Apache Spark ◽

Content Filtering ◽

Research Issues ◽

Violent Content

Download Full-text

An FPGA-based quantum circuit emulation framework using heisenberg representation

International Journal of Quantum Information ◽

10.1142/s0219749918500521 ◽

2018 ◽

Vol 16 (06) ◽

pp. 1850052

Author(s):

Y. H. Lee ◽

M. Khalil-Hani ◽

M. N. Marsono

Keyword(s):

Large Scale ◽

Scale Up ◽

Quantum Circuit ◽

Theoretical Research ◽

Resource Requirement ◽

Heisenberg Representation ◽

Speed Up ◽

Computing Platforms ◽

Classical Computing

While physical realization of practical large-scale quantum computers is still ongoing, theoretical research of quantum computing applications is facilitated on classical computing platforms through simulation and emulation methods. Nevertheless, the exponential increase in resource requirement with the increase in the number of qubits is an inherent issue in classical modeling of quantum systems. In the effort to alleviate the critical scalability issue in existing FPGA emulation works, a novel FPGA-based quantum circuit emulation framework based on Heisenberg representation is proposed in this paper. Unlike previous works that are restricted to the emulations of quantum circuits of small qubit sizes, the proposed FPGA emulation framework can scale-up to 120-qubit on Altera Stratix IV FPGA for the stabilizer circuit case study while providing notable speed-up over the equivalent simulation model.

Download Full-text