scholarly journals An Efficient Minimal Text Segmentation Method for URL Domain Names

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Yiqian Li ◽  
Tao Du ◽  
Lianjiang Zhu ◽  
Shouning Qu

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.

Author(s):  
Zeyuan Wang ◽  
Josiah Poon ◽  
Shuze Wang ◽  
Shiding Sun ◽  
Simon Poon

Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 275
Author(s):  
Igor A. Bessmertny ◽  
Xiaoxi Huang ◽  
Aleksei V. Platonov ◽  
Chuqiao Yu ◽  
Julia A. Koroleva

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.


Author(s):  
Torsten Bettinger

Although the Internet has no cross-organizational, financial, or operational management responsible for the entire Internet, certain administrative tasks are coordinated centrally. Among the most important organizational tasks that require global regulation is the management of Internet Protocol (IP) addresses and their corresponding domain names. The IP address consists of an existing 32 bit (IP4) or 128 bit (IP6) sequence of digits and is the actual physical network address by which routing on the Internet takes place and which will ensure that the data packets reach the correct host computer.


Author(s):  
Adonna Alkema

In the Netherlands, there is no legislation dealing with the registration and use of domain names. Domain name conflicts are therefore decided on the basis of existing laws, such as laws regarding the protection of trademarks and trade names and tort law. Domain name conflicts often lead to court proceedings, resulting in over 500 decisions rendered by first instance courts so far and more than 90 decisions rendered by appeal courts.


Author(s):  
Philipp Fabbio

Statutory provisions dealing specifically with domain names are found in the Codice della Proprietà Industriale (‘the CPI’),1 ss 12(1)(c), 22, 118(6), and 133. Sections 12(1)(c) and 22 define the scope of trademark protection. In doing so, they also consider interference with domain names that are used in the course of a business activity (nomi a dominio aziendali). Sections 118(6) and 133 deal with remedies for trademark infringements and make explicit reference to domain names as well. Besides these specific rules, conflicts before the Italian courts based on domain name registrations are to be resolved according to the general rules of trademark, competition, and civil law.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Hong Zhao ◽  
Zhaobin Chang ◽  
Guangbin Bao ◽  
Xiangyan Zeng

Malicious domain name attacks have become a serious issue for Internet security. In this study, a malicious domain names detection algorithm based on N-Gram is proposed. The top 100,000 domain names in Alexa 2013 are used in the N-Gram method. Each domain name excluding the top-level domain is segmented into substrings according to its domain level with the lengths of 3, 4, 5, 6, and 7. The substring set of the 100,000 domain names is established, and the weight value of a substring is calculated according to its occurrence number in the substring set. To detect a malicious attack, the domain name is also segmented by the N-Gram method and its reputation value is calculated based on the weight values of its substrings. Finally, the judgment of whether the domain name is malicious is made by thresholding. In the experiments on Alexa 2017 and Malware domain list, the proposed detection algorithm yielded an accuracy rate of 94.04%, a false negative rate of 7.42%, and a false positive rate of 6.14%. The time complexity is lower than other popular malicious domain names detection algorithms.


2006 ◽  
Vol 16 (3) ◽  
pp. 343-367 ◽  
Author(s):  
Richard A. Spinello

Abstract:The Internet presents opportunities for corporations to efficiently build their brands online and to enhance their global reach. But there are threats as well as opportunities, since anti-branding and free-riding activities are easier in cyberspace. One such threat is the unauthorized incorporation of a trademark into a domain name. This can lead to trademark dilution and cause consumer confusion. But some users claim a right to use these trademarks for the purpose of parody or criticism. Underlying these trademark conflicts is the familiar tension between property rights and free speech rights. While some trademark scholars are reluctant to consider a trademark as property, we find strong support for the property paradigm in Hegel’s philosophy. Assuming that a trademark is an earned property right, we propose that a trademark owner should be allowed to control the permutations of its trademark incorporated into domain names unless a reasonable person would not confuse that domain name with the company’s mark. But we also conclude that there must be latitude to employ a domain name for negative editorial comment, so long as the source and purpose of that domain name is plainly apparent.


2020 ◽  
Vol 21 (2) ◽  
pp. 153-163
Author(s):  
Nor Farahidah Za'bah ◽  
Ahmad Amierul Ashraf Muhammad Nazmi ◽  
Amelia Wong Azman

Segmentation is an important aspect of translating finger spelling of sign language into Latin alphabets. Although the sign language devices that are currently available can translate the finger spelling into alphabets, there is a limitation where the output is stored in a long continuous string without spaces between words. The system proposed in this work is meant to be used together with a text-generating glove device. The system used text input string and the string is then fed into the system, one character at a time, and then it is segmented into words that is semantically correct. The proposed text segmentation method in this work is by using the dynamic programming and back-off algorithm, together with the probability score using word matching with an English language text corpus. Based on the results, the system is able to properly segment words with acceptable accuracy. ABSTRAK: Segmentasi adalah aspek penting dalam menterjemahkan ejaan bahasa isyarat ke dalam huruf Latin. Walaupun terdapat peranti bahasa isyarat yang menterjemahkan ejaan jari menjadi huruf, namun begitu, huruf-huruf yang dihasilkan disimpan dalam rentetan berterusan yang panjang tanpa jarak antara setiap perkataan. Sistem yang dicadangkan di dalam jurnal ini akan diselaraskan bersama dengan sarung tangan bahasa isyarat yang boleh menghasilkan teks. Sistem ini akan mengambil rentetan input teks di mana huruf akan dimasukkan satu persatu dan huruf-huruf itu akan disegmentasikan menjadi perkataan yang betul secara semantik. Kaedah pembahagian yang dicadangkan ialah segmentasi yang menggunakan pengaturcaraan dinamik dan kaedah kebarangkalian untuk mengsegmentasikan huruf-huruf tersebut berdasarkan padanan perkataan dengan pengkalan data di dalam Bahasa Inggeris. Berdasarkan hasil yang telah diperolehi, sistem ini berjaya mengsegmentasikan huruf-huruf tersebut dengan berkesan dan tepat.


Sign in / Sign up

Export Citation Format

Share Document