string mining
Recently Published Documents


TOTAL DOCUMENTS

15
(FIVE YEARS 0)

H-INDEX

5
(FIVE YEARS 0)

2018 ◽  
Vol 45 (2) ◽  
pp. 196-211 ◽  
Author(s):  
Yu Qian ◽  
Yang Du ◽  
Xiongwen Deng ◽  
Baojun Ma ◽  
Qiongwei Ye ◽  
...  

Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora.


2016 ◽  
Author(s):  
John A. Lees ◽  
Minna Vehkala ◽  
Niko Välimäki ◽  
Simon R. Harris ◽  
Claire Chewapreecha ◽  
...  

AbstractBacterial genomes vary extensively in terms of both gene content and gene sequence – this plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to even tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogensStreptococcus pneumoniaeandStreptococcus pyogenes, SEER identifies relevant previously characterised resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness ofS. pyogenes. We thus demonstrate that our method can answer important biologically and medically relevant questions.


2014 ◽  
Vol 7 (23) ◽  
pp. 5063-5067
Author(s):  
K. Geetha Rani ◽  
Shobhanjaly P. Nair ◽  
P. Visu ◽  
S. Koteeswaran
Keyword(s):  

2012 ◽  
Vol 24 (4) ◽  
pp. 735-744 ◽  
Author(s):  
Jasbir Dhaliwal ◽  
Simon J. Puglisi ◽  
Andrew Turpin
Keyword(s):  

2010 ◽  
Vol 20-23 ◽  
pp. 653-658
Author(s):  
Zhan Xi Guo ◽  
Zhi Xin Ma ◽  
Yu Sheng Xu ◽  
Li Liu

Given m databases D1,...,Dm of strings, the purpose of the frequent string mining is to find all strings that fulfill certain constraints of all string databases. In this paper, a useful data structure is proposed to construct suffix and LCP table which can reduce the total space consumption of string mining efficiently. We demonstrate the use of this data structure by optimizing the algorithm proposed by A.Kügel et al [7] and present the improved algorithm. It is achieved that the space consumption in our algorithm is proportional to the length of the largest string of all databases. A set of comprehensive performance experiments shows that the processing rate is enhanced because amount of items are reduced in new data structure.


Author(s):  
Mohamed Abouelhoda ◽  
Moustafa Ghanem
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document