Unsupervised Large‐Scale Search for Similar Earthquake Signals

2019 ◽  
Vol 109 (4) ◽  
pp. 1451-1468 ◽  
Author(s):  
Clara E. Yoon ◽  
Karianne J. Bergen ◽  
Kexin Rong ◽  
Hashem Elezabi ◽  
William L. Ellsworth ◽  
...  

Abstract Seismology has continuously recorded ground‐motion spanning up to decades. Blind, uninformed search for similar‐signal waveforms within this continuous data can detect small earthquakes missing from earthquake catalogs, yet doing so with naive approaches is computationally infeasible. We present results from an improved version of the Fingerprint And Similarity Thresholding (FAST) algorithm, an unsupervised data‐mining approach to earthquake detection, now available as open‐source software. We use FAST to search for small earthquakes in 6–11 yr of continuous data from 27 channels over an 11‐station local seismic network near the Diablo Canyon nuclear power plant in central California. FAST detected 4554 earthquakes in this data set, with a 7.5% false detection rate: 4134 of the detected events were previously cataloged earthquakes located across California, and 420 were new local earthquake detections with magnitudes −0.3≤ML≤2.4, of which 224 events were located near the seismic network. Although seismicity rates are low, this study confirms that nearby faults are active. This example shows how seismology can leverage recent advances in data‐mining algorithms, along with improved computing power, to extract useful additional earthquake information from long‐duration continuous data sets.

2020 ◽  
Author(s):  
Isha Sood ◽  
Varsha Sharma

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed


2017 ◽  
Vol 3 (2) ◽  
pp. 5-8
Author(s):  
Линь Ганхуа ◽  
Lin Ganghua ◽  
Ван Сяо-Фань ◽  
Wang Xiao Fan ◽  
Ян Сяо ◽  
...  

This article introduces our ongoing project “Construction of a Century Solar Chromosphere Data Set for Solar Activity Related Research”. Solar activities are the major sources of space weather that affects human lives. Some of the serious space weather consequences, for instance, include interruption of space communication and navigation, compromising the safety of astronauts and satellites, and damaging power grids. Therefore, the solar activity research has both scientific and social impacts. The major database is built up from digitized and standardized film data obtained by several observatories around the world and covers a timespan more than 100 years. After careful calibration, we will develop feature extraction and data mining tools and provide them together with the comprehensive database for the astronomical community. Our final goal is to address several physical issues: filament behavior in solar cycles, abnormal behavior of solar cycle 24, large-scale solar eruptions, and sympathetic remote brightenings. Significant progresses are expected in data mining algorithms and software development, which will benefit the scientific analysis and eventually advance our understanding of solar cycles.


Author(s):  
Prasanna M. Rathod ◽  
Prof. Dr. Anjali B. Raut

Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: ? CASE: Exploiting the programming CASE construct; ? SPJ: Based on standard relational algebra operators (SPJ queries); ? PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not. For query optimization the distance computation and nearest cluster in the k-means are based on SQL. Workload balancing is the assignment of work to processors in a way that maximizes application performance. The process of load balancing can be generalized into four basic steps: 1. Monitoring processor load and state; 2. Exchanging workload and state information between processors; 3. Decision making; 4. Data migration. The decision phase is triggered when the load imbalance is detected to calculate optimal data redistribution. In the fourth and last phase, data migrates from overloaded processors to under-loaded ones.


2013 ◽  
Vol 6 (1) ◽  
pp. 237-241
Author(s):  
Shaina Dhingra ◽  
Rimple Gilhotra ◽  
Ravishanker Ravishanker

With the increasing demand of IT and subsequent growth in this sector, the high- dimensional data came into existence. Data Mining plays an important role in analyzing and extracting the useful information. The key information which is extracted from a huge pool of data is useful for decision makers. Clustering, one of the techniques of data mining is the mostly used methods of analyzing the data. In this paper, the approach of Kohonen SOM and K-Means and HAC are discussed. After that these three methods are used for analyzing the academic data set of the faculty members of particular university. Finally a comparative analysis of these algorithms are done against some parameters like number of clusters, error rate and accessing rate, etc.  This work will present new and improved results from large-scale datasets.


2011 ◽  
Vol 16 (1) ◽  
pp. 273-285 ◽  
Author(s):  
Gintautas Dzemyda ◽  
Virginijus Marcinkevičius ◽  
Viktor Medvedev

In this paper, we present an approach of the web application (as a service) for data mining oriented to the multidimensional data visualization. This paper focuses on visualization methods as a tool for the visual presentation of large-scale multidimensional data sets. The proposed implementation of such a web application obtains a multidimensional data set and as a result produces a visualization of this data set. It also supports different configuration parameters of the data mining methods used. Parallel computation has been used in the proposed implementation to run the algorithms simultaneously on different computers.


2017 ◽  
Vol 3 (2) ◽  
pp. 5-9
Author(s):  
Линь Ганхуа ◽  
Lin Ganghua ◽  
Ван Сяо-Фань ◽  
Wang Xiao Fan ◽  
Ян Сяо ◽  
...  

This article introduces our ongoing project “Construction of a Century Solar Chromosphere Data Set for Solar Activity Related Research”. Solar activities are the major sources of space weather that affects human lives. Some of the serious space weather consequences, for instance, include interruption of space communication and navigation, compromising the safety of astronauts and satellites, and damaging power grids. Therefore, the solar activity research has both scientific and social impacts. The major database is built up from digitized and standardized film data obtained by several observatories around the world and covers a timespan more than 100 years. After careful calibration, we will develop feature extraction and data mining tools and provide them together with the comprehensive database for the astronomical community. Our final goal is to address several physical issues: filament behavior in solar cycles, abnormal behavior of solar cycle 24, large-scale solar eruptions, and sympathetic remote brightenings. Significant progresses are expected in data mining algorithms and software development, which will benefit the scientific analysis and eventually advance our understanding of solar cycles.


2020 ◽  
Vol 12 (3) ◽  
pp. 81-89
Author(s):  
T. Sanlı ◽  
Ç. Sıcakyüz ◽  
O.H. Yüregir

Data mining, which has different uses such as text mining and web mining, is especially used for clustering and classification purposes. In this study, this method was used for both classification and text mining. The aim of the study was the assessment of the performances of the data mining algorithms on the three datasets. A total of 6631 master's and doctoral dissertations written in the field of industrial engineering were downloaded from the Higher Education Council database. With the help of summary, subject titles and keywords of these dissertations, it was tried to be guessed which sub-field of industrial engineering it belongs to using WEKA program. As a result, it was observed that the data set containing the keywords obtained by weighting the expert opinion was more successful than the other two data sets. And the three most successful classification algorithms were found to be kNN, SMO, and J48, respectively. Keywords: Classification Algorithms, Data Mining, Multiple Classes, Dataset.


2016 ◽  
Vol 15 (6) ◽  
pp. 6806-6813 ◽  
Author(s):  
Sethunya R Joseph ◽  
Hlomani Hlomani ◽  
Keletso Letsholo

The research on data mining has successfully yielded numerous tools, algorithms, methods and approaches for handling large amounts of data for various purposeful use and   problem solving. Data mining has become an integral part of many application domains such as data ware housing, predictive analytics, business intelligence, bio-informatics and decision support systems. Prime objective of data mining is to effectively handle large scale data, extract actionable patterns, and gain insightful knowledge. Data mining is part and parcel of knowledge discovery in databases (KDD) process. Success and improved decision making normally depends on how quickly one can discover insights from data. These insights could be used to drive better actions which can be used in operational processes and even predict future behaviour. This paper presents an overview of various algorithms necessary for handling large data sets. These algorithms define various structures and methods implemented to handle big data. The review also discusses the general strengths and limitations of these algorithms. This paper can quickly guide or an eye opener to the data mining researchers on which algorithm(s) to select and apply in solving the problems they will be investigating.


Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


2015 ◽  
Vol 8 (1) ◽  
pp. 421-434 ◽  
Author(s):  
M. P. Jensen ◽  
T. Toto ◽  
D. Troyan ◽  
P. E. Ciesielski ◽  
D. Holdridge ◽  
...  

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.


Sign in / Sign up

Export Citation Format

Share Document