A Highly Reliable Storage Systems Based on SSD Array for IoE Environment

Author(s):  
HooYoung Ahn ◽  
Junsu Kim ◽  
YoonJoon Lee

Devices in IoE (Internet of Everything) environment generate massive data from various sensors. To store and process the rapidly incoming large-scale data, SSDs are used for improving performance and reliability of storage systems. However, they have typical problem called write amplification which is caused by out-of-place updates characteristics. As the write amplification increases, it degrades I/O performance and shortens SSDs' lifetime. This paper presents a new approach to reduce write amplification of SSD arrays. To solve the problem, this paper proposes a new parity update scheme, called LPUS. LPUS transforms random parity updates to sequential writes with additional log blocks in SSD arrays by using parity logs and lazy parity updates. The experimental results show that, LPUS reduces write amplification up to 37% and the number of erases up to 50% with the reasonable size of log space.

2019 ◽  
Author(s):  
Yasset Perez-Riverol ◽  
Pablo Moreno

AbstractThe recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabolomics communities a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy and Nextflow.


Author(s):  
Wei Wang ◽  
Xiang-Yu Guo ◽  
Shao-Yuan Li ◽  
Yuan Jiang ◽  
Zhi-Hua Zhou

Crowdsourcing systems make it possible to hire voluntary workers to label large-scale data by offering them small monetary payments. Usually, the taskmaster requires to collect high-quality labels, while the quality of labels obtained from the crowd may not satisfy this requirement. In this paper, we study the problem of obtaining high-quality labels from the crowd and present an approach of learning the difficulty of items in crowdsourcing, in which we construct a small training set of items with estimated difficulty and then learn a model to predict the difficulty of future items. With the predicted difficulty, we can distinguish between easy and hard items to obtain high-quality labels. For easy items, the quality of their labels inferred from the crowd could be high enough to satisfy the requirement; while for hard items, the crowd could not provide high-quality labels, it is better to choose a more knowledgable crowd or employ specialized workers to label them. The experimental results demonstrate that the proposed approach by learning to distinguish between easy and hard items can significantly improve the label quality.


1999 ◽  
Vol 3 (1) ◽  
pp. 53-60
Author(s):  
Kristi Yuthas ◽  
Dennis F. Togo

In this era of massive data accumulation, dynamic development of large-scale data-bases and interfaces intended to be user-friendly, there is still an increasing demand on analysts as actual user access to databases is still not a common practice. A data dictionary approach, that includes providing users with a list of relevant data items within the database, can expedite the analysis of information requirements and the development of user-requested information systems. Furthermore, this approach enhances user involvement and reduces the demands on the analysts for systems devel-opment projects.


2021 ◽  
Author(s):  
Qi Zhai ◽  
Zhigang Kan ◽  
Linhui Feng ◽  
Linbo Qiao ◽  
Feng Liu

Recently, Chinese event detection has attracted more and more attention. As a special kind of hieroglyphics, Chinese glyphs are semantically useful but still unexplored in this task. In this paper, we propose a novel Glyph-Aware Fusion Network, named GlyFN. It introduces the glyphs' information into the pre-trained language model representation. To obtain a better representation, we design a Vector Linear Fusion mechanism to fuse them. Specifically, it first utilizes a max-pooling to capture salient information. Then, we use the linear operation of vectors to retain unique information. Moreover, for large-scale unstructured text, we distribute the data into different clusters parallelly. Finally, we conduct extensive experiments on ACE2005 and large-scale data. Experimental results show that GlyFN obtains increases of 7.48(10.18%) and 6.17(8.7%) in the F1-score for trigger identification and classification over the state-of-the-art methods, respectively. Furthermore, the event detection task for large-scale unstructured text can be efficiently accomplished through distribution.


2020 ◽  
Vol 10 (5) ◽  
pp. 314
Author(s):  
Jingbin Yuan ◽  
Jing Zhang ◽  
Lijun Shen ◽  
Dandan Zhang ◽  
Wenhuan Yu ◽  
...  

Recently, with the rapid development of electron microscopy (EM) technology and the increasing demand of neuron circuit reconstruction, the scale of reconstruction data grows significantly. This brings many challenges, one of which is how to effectively manage large-scale data so that researchers can mine valuable information. For this purpose, we developed a data management module equipped with two parts, a storage and retrieval module on the server-side and an image cache module on the client-side. On the server-side, Hadoop and HBase are introduced to resolve massive data storage and retrieval. The pyramid model is adopted to store electron microscope images, which represent multiresolution data of the image. A block storage method is proposed to store volume segmentation results. We design a spatial location-based retrieval method for fast obtaining images and segments by layers rapidly, which achieves a constant time complexity. On the client-side, a three-level image cache module is designed to reduce latency when acquiring data. Through theoretical analysis and practical tests, our tool shows excellent real-time performance when handling large-scale data. Additionally, the server-side can be used as a backend of other similar software or a public database to manage shared datasets, showing strong scalability.


Top ◽  
2021 ◽  
Author(s):  
John Martinovic ◽  
Markus Hähnel ◽  
Guntram Scheithauer ◽  
Waltenegus Dargie

AbstractThe energy consumption of large-scale data centers or server clusters is expected to grow significantly in the next couple of years contributing to up to 13% of the worldwide energy demand in 2030. As the involved processing units require a disproportional amount of energy when they are idle, underutilized, or overloaded, balancing the supply of and the demand for computing resources is a key issue to obtain energy-efficient server consolidations. Whereas traditional concepts mostly consider deterministic predictions of the future workloads or only aim at finding approximate solutions, in this article, we propose an exact approach to tackle the problem of assigning jobs with (not necessarily independent) stochastic characteristics to a minimal amount of servers subject to further practically relevant constraints. As a main contribution, the problem under consideration is reformulated as a stochastic bin packing problem with conflicts and modeled by an integer linear program. Finally, this new approach is tested on real-world instances obtained from a Google data center.


2014 ◽  
Vol 568-570 ◽  
pp. 1539-1546
Author(s):  
Xin Li Li

Large-scale data streams processing is now fundamental to many data processing applications. There is growing focus on manipulating Large-scale data streams on GPUs in order to improve the data throughput. Hence, there is a need to investigate the parallel scheduling strategy at the task level for the Large-scale data streamsprocessing, and to support them efficiently. We propose two different parallel scheduling strategies to handle massive data streamsin real time. Additionally, massive data streamsprocessing on GPUs is energy-consumed computation task. So we consider the power efficiency as an important factor to the parallel strategies. We present an approximation method to quantify the power efficiency for massive data streams during the computing phase. Finally, we test and compare the two parallel scheduling strategies on a large quantity of synthetic and real stream datas. The simulation experiments and compuatation results in practice both prove the accuracy of analysis on performance and power efficiency.


2012 ◽  
Vol 1438 ◽  
Author(s):  
Slavisa Aleksic ◽  
Gerhard Schmid ◽  
Naida Fehratovic

ABSTRACTThe ever-growing Internet data traffic leads to a continuously increasing demand in both capacity and performance of large-scale Information and Communication (ICT) systems such as high-capacity routers and switches, large data centers, and supercomputers. Complex and spatially distributed multirack systems comprising a large number of data processing and storage modules with high-speed interfaces have already become reality. A consequence of this trend is that internal interconnection systems also become large and complex. Interconnection distances, total required number of cables, and power consumption increase rapidly with the increase in capacity, which can cause limitations in scalability of the whole system. This paper addresses requirements and limitations of intrasystem interconnects for application in large-scale data processing and storage systems. Various point-to-point and optically switched interconnection options are reviewed with regard to their potential to achieve large scalability while reducing power consumption.


Sign in / Sign up

Export Citation Format

Share Document