scholarly journals Language Independent and Multilingual Language Identification using Infinity Ngram Approach

Author(s):  
Kidst Ergetie Andargie ◽  
Tsegay Mullu Kassa

Now days it is possible to get massive amount of multilingual digital information that are generated, propagated, exchanged, stored and accessed through the web each day across the world. Such accumulation of multilingual digital data becomes an obstacle for information acquisition. In order to tackling such difficulty language identification is the first step among many steps that are used for information acquisition. Language identification is the process of labeling given text content into corresponding language category. In past decades research works have been done in the area of language identification. However, there are issues which are not solved until: multilingual language identification, discriminating language category of very closely related languages documents and labelling the language category for very short texts like words or phrases. In this investigation, we propose an approach which able to eradicate unsolved issues of language identification (i.e. multilingual and very short texts language identification) without language barrier. In order to attain this we adopt an approach of that uses all character ngram features of given text unit (i.e. word, phrase or etc). Moreover, the proposed approach has a capability of identify the language of a text at any text unit (i.e. word, phrase, sentence or document) in both monolingual and multilingual setting. The reason behind this capability of proposed approach is due to adopting word level features, in which every words need to be classify with regard to its language category. The infinity ngram approach uses all character ngrams of text unit together in order to label the language category of given text per word level. In order to observe the effectiveness of the proposed approach four experimental techniques are conducted: pure infinity character ngram, infinity ngram with location feature and infinity ngram with sentence and document level reformulation. The experimental result indicates that an infinity ngram with location feature and along with sentence and document level reformulation achieves a promising result, which is an average F-measure of 100% at word, phrase, sentence, document level in monolingual setting. As well, for multilingual setting also attains an average F-measure of 100% for both sentence and document level, but for phrase level achieves 84.33%, 88.95% and 90.19% For Amharic, Geeze and Tigrigna respectively. Beside this, at word level achieves 83.16%, 80.96% and 85.85% for Amharic, Geeze, and Tigrigna respectively.

2019 ◽  
Vol 15 (01) ◽  
pp. 1-8
Author(s):  
Ashish C Patel ◽  
C G Joshi

Current data storage technologies cannot keep pace longer with exponentially growing amounts of data through the extensive use of social networking photos and media, etc. The "digital world” with 4.4 zettabytes in 2013 has predicted it to reach 44 zettabytes by 2020. From the past 30 years, scientists and researchers have been trying to develop a robust way of storing data on a medium which is dense and ever-lasting and found DNA as the most promising storage medium. Unlike existing storage devices, DNA requires no maintenance, except the need to store at a cool and dark place. DNA has a small size with high density; just 1 gram of dry DNA can store about 455 exabytes of data. DNA stores the informations using four bases, viz., A, T, G, and C, while CDs, hard disks and other devices stores the information using 0’s and 1’s on the spiral tracks. In the DNA based storage, after binarization of digital file into the binary codes, encoding and decoding are important steps in DNA based storage system. Once the digital file is encoded, the next step is to synthesize arbitrary single-strand DNA sequences and that can be stored in the deep freeze until use.When there is a need for information to be recovered, it can be done using DNA sequencing. New generation sequencing (NGS) capable of producing sequences with very high throughput at a much lower cost about less than 0.1 USD for one MB of data than the first sequencing technologies. Post-sequencing processing includes alignment of all reads using multiple sequence alignment (MSA) algorithms to obtain different consensus sequences. The consensus sequence is decoded as the reversal of the encoding process. Most prior DNA data storage efforts sequenced and decoded the entire amount of stored digital information with no random access, but nowadays it has become possible to extract selective files (e.g., retrieving only required image from a collection) from a DNA pool using PCR-based random access. Various scientists successfully stored up to 110 zettabytes data in one gram of DNA. In the future, with an efficient encoding, error corrections, cheaper DNA synthesis,and sequencing, DNA based storage will become a practical solution for storage of exponentially growing digital data.


2021 ◽  
Vol 23 (4) ◽  
pp. 796-815
Author(s):  
Yang Wang ◽  
Sun Sun Lim

People are today located in media ecosystems in which a variety of ICT devices and platforms coexist and complement each other to fulfil users’ heterogeneous requirements. These multi-media affordances promote a highly hyperlinked and nomadic habit of digital data management which blurs the long-standing boundaries between information storage, sharing and exchange. Specifically, during the pervasive sharing and browsing of fragmentary digital information (e.g. photos, videos, online diaries, news articles) across various platforms, life experiences and knowledge involved are meanwhile classified and stored for future retrieval and collective memory construction. For international migrants who straddle different geographical and cultural contexts, management of various digital materials is particularly complicated as they have to be familiar with and appropriately navigate technological infrastructures of both home and host countries. Drawing on ethnographic observations of 40 Chinese migrant mothers in Singapore, this article delves into their quotidian routines of acquiring, storing, sharing and exchanging digital information across a range of ICT devices and platforms, as well as cultural and emotional implications of these mediated behaviours for their everyday life experiences. A multi-layer and multi-sited repertoire of ‘life archiving’ was identified among these migrant mothers in which they leave footprints of everyday life through a tactical combination of interactive sharing, pervasive tagging and backup storage of diverse digital content.


2017 ◽  
Vol 45 (2) ◽  
pp. 66-74
Author(s):  
Yufeng Ma ◽  
Long Xia ◽  
Wenqi Shen ◽  
Mi Zhou ◽  
Weiguo Fan

Purpose The purpose of this paper is automatic classification of TV series reviews based on generic categories. Design/methodology/approach What the authors mainly applied is using surrogate instead of specific roles or actors’ name in reviews to make reviews more generic. Besides, feature selection techniques and different kinds of classifiers are incorporated. Findings With roles’ and actors’ names replaced by generic tags, the experimental result showed that it can generalize well to agnostic TV series as compared with reviews keeping the original names. Research limitations/implications The model presented in this paper must be built on top of an already existed knowledge base like Baidu Encyclopedia. Such database takes lots of work. Practical implications Like in digital information supply chain, if reviews are part of the information to be transported or exchanged, then the model presented in this paper can help automatically identify individual review according to different requirements and help the information sharing. Originality/value One originality is that the authors proposed the surrogate-based approach to make reviews more generic. Besides, they also built a review data set of hot Chinese TV series, which includes eight generic category labels for each review.


Author(s):  
Somnath Banerjee ◽  
Alapan Kuila ◽  
Aniruddha Roy ◽  
Sudip Kumar Naskar ◽  
Paolo Rosso ◽  
...  

2020 ◽  
pp. 19-43
Author(s):  
Henri Schildt

This chapter examines digitalization as a set of new normative ideals for managing and organizing businesses, enabled by new technologies. The data imperative consists of two mutually reinforcing goals: the pursuit of omniscience—the aspiration of management to capture the world relevant to the company through digital data; and the pursuit of omnipotence—an aspiration of managers to control and optimize activities in real-time and around the world through software. The data imperative model captures a self-reinforcing cycle of four sequential steps: (1) the creation and capture of data, (2) the combination and analysis of data, (3) the redesign of business processes around smart algorithms, and (4) the ability to control the world through digital information flows. The logical end-point of the data imperative is a ‘programmable world’, a conception of society saturated with Internet-connected hardware that is able to capture processes in real time and control them in order to optimize desired outcomes.


Information ◽  
2019 ◽  
Vol 10 (11) ◽  
pp. 332 ◽  
Author(s):  
Kenneth Thibodeau

This paper presents Constructed Past Theory, an epistemological theory about how we come to know things that happened or existed in the past. The theory is expounded both in text and in a formal model comprising UML class diagrams. The ideas presented here have been developed in a half century of experience as a practitioner in the management of information and automated systems in the US government and as a researcher in several collaborations, notably the four international and multidisciplinary InterPARES projects. This work is part of a broader initiative, providing a conceptual framework for reformulating the concepts and theories of archival science in order to enable a new discipline whose assertions are empirically and, wherever possible, quantitatively testable. The new discipline, called archival engineering, is intended to provide an appropriate, coherent foundation for the development of systems and applications for managing, preserving and providing access to digital information, development which is necessitated by the exponential growth and explosive diversification of data recorded in digital form and the use of digital data in an ever increasing variety of domains. Both the text and model are an initial exposition of the theory that both requires and invites further development.


2012 ◽  
Vol 201-202 ◽  
pp. 991-995
Author(s):  
Xing She ◽  
Hai Bo Wang ◽  
Wang Qun Xiao ◽  
Qun Yan

This thesis has demonstrated the feasibility of setting up the 3D model database and information query platform of Huizhou traditional dwellings decorative art, by using the digital information acquisition and processing technology, and summarized the terminal design method as well. It provides a technical service means and ways for the research of Hui-style dwellings construction, inheritance of Hui-culture and the related art design practice. This terminal provides a new model of the promotion, propaganda, application and development to the world, for the Hui-style dwellings construction decorative art. It will enrich the digital contents of Hui culture, which has the important practical and strategic significance to the construction of "digital Anhui".


10.2196/17638 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e17638
Author(s):  
Jian Wang ◽  
Xiaoyu Chen ◽  
Yu Zhang ◽  
Yijia Zhang ◽  
Jiabin Wen ◽  
...  

Background Automatically extracting relations between chemicals and diseases plays an important role in biomedical text mining. Chemical-disease relation (CDR) extraction aims at extracting complex semantic relationships between entities in documents, which contain intrasentence and intersentence relations. Most previous methods did not consider dependency syntactic information across the sentences, which are very valuable for the relations extraction task, in particular, for extracting the intersentence relations accurately. Objective In this paper, we propose a novel end-to-end neural network based on the graph convolutional network (GCN) and multihead attention, which makes use of the dependency syntactic information across the sentences to improve CDR extraction task. Methods To improve the performance of intersentence relation extraction, we constructed a document-level dependency graph to capture the dependency syntactic information across sentences. GCN is applied to capture the feature representation of the document-level dependency graph. The multihead attention mechanism is employed to learn the relatively important context features from different semantic subspaces. To enhance the input representation, the deep context representation is used in our model instead of traditional word embedding. Results We evaluate our method on CDR corpus. The experimental results show that our method achieves an F-measure of 63.5%, which is superior to other state-of-the-art methods. In the intrasentence level, our method achieves a precision, recall, and F-measure of 59.1%, 81.5%, and 68.5%, respectively. In the intersentence level, our method achieves a precision, recall, and F-measure of 47.8%, 52.2%, and 49.9%, respectively. Conclusions The GCN model can effectively exploit the across sentence dependency information to improve the performance of intersentence CDR extraction. Both the deep context representation and multihead attention are helpful in the CDR extraction task.


2018 ◽  
Author(s):  
Henry H. Lee ◽  
Reza Kalhor ◽  
Naveen Goela ◽  
Jean Bolot ◽  
George M. Church

AbstractDNA is an emerging storage medium for digital data but its adoption is hampered by limitations of phosphoramidite chemistry, which was developed for single-base accuracy required for biological functionality. Here, we establish a de novo enzymatic DNA synthesis strategy designed from the bottom-up for information storage. We harness a template-independent DNA polymerase for controlled synthesis of sequences with user-defined information content. We demonstrate retrieval of 144-bits, including addressing, from perfectly synthesized DNA strands using batch-processed Illumina and real-time Oxford Nanopore sequencing. We then develop a codec for data retrieval from populations of diverse but imperfectly synthesized DNA strands, each with a ~30% error tolerance. With this codec, we experimentally validate a kilobyte-scale design which stores 1 bit per nucleotide. Simulations of the codec support reliable and robust storage of information for large-scale systems. This work paves the way for alternative synthesis and sequencing strategies to advance information storage in DNA.


In today’s world, the enhancement in internet technologies, digital data are mostly used to share the information in public networks. There are many traditional security techniques used to provide security to the digital information. But the existing methods don’t provide much of the security to digital media like image, video, audio, etc. The digital watermarking is employed in the protection of digital information. This paper gives a review on digital image watermarking based on the visual cryptography to reach secure protection for the images. The secret information can be inserted in the original images. The secret key is generated from the watermark image with the help of visual cryptography to claim the ownership of images. Various types of Visual Cryptography and Digital Image Watermarking techniques are explained in real time application.


Sign in / Sign up

Export Citation Format

Share Document