A Highly Reliable Storage Systems Based on SSD Array for IoE Environment

HooYoung Ahn; Junsu Kim; YoonJoon Lee

doi:10.4018/ijghpc.2017100101

A Highly Reliable Storage Systems Based on SSD Array for IoE Environment

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2017100101 ◽

2017 ◽

Vol 9 (4) ◽

pp. 1-15

Author(s):

HooYoung Ahn ◽

Junsu Kim ◽

YoonJoon Lee

Keyword(s):

Large Scale ◽

Storage Systems ◽

Experimental Results ◽

Massive Data ◽

Typical Problem ◽

New Approach ◽

Internet Of Everything ◽

Large Scale Data ◽

Write Amplification ◽

Scale Data

Devices in IoE (Internet of Everything) environment generate massive data from various sensors. To store and process the rapidly incoming large-scale data, SSDs are used for improving performance and reliability of storage systems. However, they have typical problem called write amplification which is caused by out-of-place updates characteristics. As the write amplification increases, it degrades I/O performance and shortens SSDs' lifetime. This paper presents a new approach to reduce write amplification of SSD arrays. To solve the problem, this paper proposes a new parity update scheme, called LPUS. LPUS transforms random parity updates to sequential writes with additional log blocks in SSD arrays by using parity logs and lazy parity updates. The experimental results show that, LPUS reduces write amplification up to 37% and the number of erases up to 50% with the reasonable size of log space.

Download Full-text

Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines

10.1101/604413 ◽

2019 ◽

Author(s):

Yasset Perez-Riverol ◽

Pablo Moreno

Keyword(s):

Data Analysis ◽

Large Scale ◽

Data Science ◽

Proteomics Data ◽

Computational Proteomics ◽

New Approach ◽

Large Scale Data ◽

Desktop Application ◽

Key Steps ◽

Scale Data

AbstractThe recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabolomics communities a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy and Nextflow.

Download Full-text

Obtaining High-Quality Label by Distinguishing between Easy and Hard Items in Crowdsourcing

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/413 ◽

2017 ◽

Cited By ~ 4

Author(s):

Wei Wang ◽

Xiang-Yu Guo ◽

Shao-Yuan Li ◽

Yuan Jiang ◽

Zhi-Hua Zhou

Keyword(s):

Large Scale ◽

Experimental Results ◽

Training Set ◽

High Quality ◽

Quality Label ◽

Large Scale Data ◽

Voluntary Workers ◽

Scale Data

Crowdsourcing systems make it possible to hire voluntary workers to label large-scale data by offering them small monetary payments. Usually, the taskmaster requires to collect high-quality labels, while the quality of labels obtained from the crowd may not satisfy this requirement. In this paper, we study the problem of obtaining high-quality labels from the crowd and present an approach of learning the difficulty of items in crowdsourcing, in which we construct a small training set of items with estimated difficulty and then learn a model to predict the difficulty of future items. With the predicted difficulty, we can distinguish between easy and hard items to obtain high-quality labels. For easy items, the quality of their labels inferred from the crowd could be high enough to satisfy the requirement; while for hard items, the crowd could not provide high-quality labels, it is better to choose a more knowledgable crowd or employ specialized workers to label them. The experimental results demonstrate that the proposed approach by learning to distinguish between easy and hard items can significantly improve the label quality.

Download Full-text

A Data Dictionary Approach To Meeting User Requests For Accounting Information

Review of Business Information Systems (RBIS) ◽

10.19030/rbis.v3i1.5422 ◽

1999 ◽

Vol 3 (1) ◽

pp. 53-60

Author(s):

Kristi Yuthas ◽

Dennis F. Togo

Keyword(s):

Large Scale ◽

Accounting Information ◽

User Involvement ◽

Massive Data ◽

Data Dictionary ◽

Large Scale Data ◽

User Access ◽

Increasing Demand ◽

User Friendly ◽

Scale Data

In this era of massive data accumulation, dynamic development of large-scale data-bases and interfaces intended to be user-friendly, there is still an increasing demand on analysts as actual user access to databases is still not a common practice. A data dictionary approach, that includes providing users with a list of relevant data items within the database, can expedite the analysis of information requirements and the development of user-requested information systems. Furthermore, this approach enhances user involvement and reduces the demands on the analysts for systems devel-opment projects.

Download Full-text

Optimized management of large-scale data sets stored on tertiary storage systems

IEEE Distributed Systems Online ◽

10.1109/mdso.2004.5 ◽

2004 ◽

Vol 5 (5) ◽

pp. 3-5 ◽

Cited By ~ 6

Author(s):

B. Reiner ◽

K. Hahn

Keyword(s):

Large Scale ◽

Storage Systems ◽

Data Sets ◽

Large Scale Data ◽

Tertiary Storage ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Effect of Replica Placement on the Reliability of Large-Scale Data Storage Systems

2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems ◽

10.1109/mascots.2010.17 ◽

2010 ◽

Cited By ~ 11

Author(s):

Vinodh Venkatesan ◽

Ilias Iliadis ◽

Xiao-Yu Hu ◽

Robert Haas ◽

Christina Fragouli

Keyword(s):

Data Storage ◽

Large Scale ◽

Storage Systems ◽

Replica Placement ◽

Large Scale Data ◽

Scale Data

Download Full-text

Glyfn: A Glyph-Aware Fusion Network for Distributed Chinese Event Detection

10.5121/csit.2021.110114 ◽

2021 ◽

Author(s):

Qi Zhai ◽

Zhigang Kan ◽

Linhui Feng ◽

Linbo Qiao ◽

Feng Liu

Keyword(s):

Event Detection ◽

Large Scale ◽

State Of The Art ◽

Language Model ◽

Special Kind ◽

Detection Task ◽

Experimental Results ◽

Large Scale Data ◽

Unstructured Text ◽

Scale Data

Recently, Chinese event detection has attracted more and more attention. As a special kind of hieroglyphics, Chinese glyphs are semantically useful but still unexplored in this task. In this paper, we propose a novel Glyph-Aware Fusion Network, named GlyFN. It introduces the glyphs' information into the pre-trained language model representation. To obtain a better representation, we design a Vector Linear Fusion mechanism to fuse them. Specifically, it first utilizes a max-pooling to capture salient information. Then, we use the linear operation of vectors to retain unique information. Moreover, for large-scale unstructured text, we distribute the data into different clusters parallelly. Finally, we conduct extensive experiments on ACE2005 and large-scale data. Experimental results show that GlyFN obtains increases of 7.48(10.18%) and 6.17(8.7%) in the F1-score for trigger identification and classification over the state-of-the-art methods, respectively. Furthermore, the event detection task for large-scale unstructured text can be efficiently accomplished through distribution.

Download Full-text

Massive Data Management and Sharing Module for Connectome Reconstruction

Brain Sciences ◽

10.3390/brainsci10050314 ◽

2020 ◽

Vol 10 (5) ◽

pp. 314

Author(s):

Jingbin Yuan ◽

Jing Zhang ◽

Lijun Shen ◽

Dandan Zhang ◽

Wenhuan Yu ◽

...

Keyword(s):

Data Management ◽

Data Storage ◽

Large Scale ◽

Rapid Development ◽

Massive Data ◽

Storage And Retrieval ◽

Server Side ◽

Large Scale Data ◽

Client Side ◽

Scale Data

Recently, with the rapid development of electron microscopy (EM) technology and the increasing demand of neuron circuit reconstruction, the scale of reconstruction data grows significantly. This brings many challenges, one of which is how to effectively manage large-scale data so that researchers can mine valuable information. For this purpose, we developed a data management module equipped with two parts, a storage and retrieval module on the server-side and an image cache module on the client-side. On the server-side, Hadoop and HBase are introduced to resolve massive data storage and retrieval. The pyramid model is adopted to store electron microscope images, which represent multiresolution data of the image. A block storage method is proposed to store volume segmentation results. We design a spatial location-based retrieval method for fast obtaining images and segments by layers rapidly, which achieves a constant time complexity. On the client-side, a three-level image cache module is designed to reduce latency when acquiring data. Through theoretical analysis and practical tests, our tool shows excellent real-time performance when handling large-scale data. Additionally, the server-side can be used as a backend of other similar software or a public database to manage shared datasets, showing strong scalability.

Download Full-text

An introduction to stochastic bin packing-based server consolidation with conflicts

Top ◽

10.1007/s11750-021-00613-1 ◽

2021 ◽

Author(s):

John Martinovic ◽

Markus Hähnel ◽

Guntram Scheithauer ◽

Waltenegus Dargie

Keyword(s):

Energy Demand ◽

Large Scale ◽

Bin Packing ◽

Packing Problem ◽

Approximate Solutions ◽

Minimal Amount ◽

New Approach ◽

Large Scale Data ◽

Server Clusters ◽

Scale Data

AbstractThe energy consumption of large-scale data centers or server clusters is expected to grow significantly in the next couple of years contributing to up to 13% of the worldwide energy demand in 2030. As the involved processing units require a disproportional amount of energy when they are idle, underutilized, or overloaded, balancing the supply of and the demand for computing resources is a key issue to obtain energy-efficient server consolidations. Whereas traditional concepts mostly consider deterministic predictions of the future workloads or only aim at finding approximate solutions, in this article, we propose an exact approach to tackle the problem of assigning jobs with (not necessarily independent) stochastic characteristics to a minimal amount of servers subject to further practically relevant constraints. As a main contribution, the problem under consideration is reformulated as a stochastic bin packing problem with conflicts and modeled by an integer linear program. Finally, this new approach is tested on real-world instances obtained from a Google data center.

Download Full-text

Optimal Energy Management Strategy for Parallel Scheduling

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.568-570.1539 ◽

2014 ◽

Vol 568-570 ◽

pp. 1539-1546

Author(s):

Xin Li Li

Keyword(s):

Data Streams ◽

Power Efficiency ◽

Data Stream ◽

Large Scale ◽

Massive Data ◽

Energy Management Strategy ◽

Parallel Scheduling ◽

Large Scale Data ◽

Scheduling Strategies ◽

Scale Data

Large-scale data streams processing is now fundamental to many data processing applications. There is growing focus on manipulating Large-scale data streams on GPUs in order to improve the data throughput. Hence, there is a need to investigate the parallel scheduling strategy at the task level for the Large-scale data streamsprocessing, and to support them efficiently. We propose two different parallel scheduling strategies to handle massive data streamsin real time. Additionally, massive data streamsprocessing on GPUs is energy-consumed computation task. So we consider the power efficiency as an important factor to the parallel strategies. We present an approximation method to quantify the power efficiency for massive data streams during the computing phase. Finally, we test and compare the two parallel scheduling strategies on a large quantity of synthetic and real stream datas. The simulation experiments and compuatation results in practice both prove the accuracy of analysis on performance and power efficiency.

Download Full-text

Limitations and Perspectives of Optically Switched Interconnects for Large-scale Data Processing and Storage Systems

MRS Proceedings ◽

10.1557/opl.2012.1409 ◽

2012 ◽

Vol 1438 ◽

Cited By ~ 1

Author(s):

Slavisa Aleksic ◽

Gerhard Schmid ◽

Naida Fehratovic

Keyword(s):

Power Consumption ◽

Data Processing ◽

Large Scale ◽

Storage Systems ◽

Reducing Power ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Processing And Storage ◽

And Storage ◽

Scale Data

ABSTRACTThe ever-growing Internet data traffic leads to a continuously increasing demand in both capacity and performance of large-scale Information and Communication (ICT) systems such as high-capacity routers and switches, large data centers, and supercomputers. Complex and spatially distributed multirack systems comprising a large number of data processing and storage modules with high-speed interfaces have already become reality. A consequence of this trend is that internal interconnection systems also become large and complex. Interconnection distances, total required number of cables, and power consumption increase rapidly with the increase in capacity, which can cause limitations in scalability of the whole system. This paper addresses requirements and limitations of intrasystem interconnects for application in large-scale data processing and storage systems. Various point-to-point and optically switched interconnection options are reviewed with regard to their potential to achieve large scalability while reducing power consumption.

Download Full-text