An Efficient Minimal Text Segmentation Method for URL Domain Names

Scientific Programming ◽

10.1155/2021/9946729 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Yiqian Li ◽

Tao Du ◽

Lianjiang Zhu ◽

Shouning Qu

Keyword(s):

Word Segmentation ◽

Quality Data ◽

Maximum Matching ◽

Text Segmentation ◽

Domain Name ◽

Noise Interference ◽

Domain Names ◽

Text Length ◽

Novel Method ◽

Areas Of Interest

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.

Download Full-text

A Novel Method for Clinical Risk Prediction with Low-Quality Data

Artificial Intelligence in Medicine ◽

10.1016/j.artmed.2021.102052 ◽

2021 ◽

pp. 102052

Author(s):

Zeyuan Wang ◽

Josiah Poon ◽

Shuze Wang ◽

Shiding Sun ◽

Simon Poon

Keyword(s):

Risk Prediction ◽

Quality Data ◽

Clinical Risk ◽

Novel Method

Download Full-text

Applying the Bell’s Test to Chinese Texts

Entropy ◽

10.3390/e22030275 ◽

2020 ◽

Vol 22 (3) ◽

pp. 275

Author(s):

Igor A. Bessmertny ◽

Xiaoxi Huang ◽

Aleksei V. Platonov ◽

Chuqiao Yu ◽

Julia A. Koroleva

Keyword(s):

Quantum Entanglement ◽

Chinese Text ◽

Search Engines ◽

Text Processing ◽

Word Segmentation ◽

Significant Problem ◽

Text Segmentation ◽

Text Documents ◽

Segmentation Algorithms ◽

Chinese Texts

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.

Download Full-text

Structure and Organization of the Domain Name System

Domain Name Law And Practice ◽

10.1093/oso/9780199663163.003.0002 ◽

2015 ◽

Author(s):

Torsten Bettinger

Keyword(s):

Host Computer ◽

The Internet ◽

Domain Name System ◽

Domain Name ◽

Global Regulation ◽

Ip Address ◽

Domain Names ◽

Data Packets ◽

System P ◽

Administrative Tasks

Although the Internet has no cross-organizational, financial, or operational management responsible for the entire Internet, certain administrative tasks are coordinated centrally. Among the most important organizational tasks that require global regulation is the management of Internet Protocol (IP) addresses and their corresponding domain names. The IP address consists of an existing 32 bit (IP4) or 128 bit (IP6) sequence of digits and is the actual physical network address by which routing on the Internet takes place and which will ensure that the data packets reach the correct host computer.

Download Full-text

Netherlands (‘.nl’)

Domain Name Law And Practice ◽

10.1093/oso/9780199663163.003.0026 ◽

2015 ◽

Author(s):

Adonna Alkema

Keyword(s):

The Netherlands ◽

Tort Law ◽

Domain Name ◽

Domain Names ◽

Court Proceedings ◽

Trade Names ◽

Appeal Courts

In the Netherlands, there is no legislation dealing with the registration and use of domain names. Domain name conflicts are therefore decided on the basis of existing laws, such as laws regarding the protection of trademarks and trade names and tort law. Domain name conflicts often lead to court proceedings, resulting in over 500 decisions rendered by first instance courts so far and more than 90 decisions rendered by appeal courts.

Download Full-text

Italy (‘.it’)

Domain Name Law And Practice ◽

10.1093/oso/9780199663163.003.0024 ◽

2015 ◽

Author(s):

Philipp Fabbio

Keyword(s):

Civil Law ◽

Business Activity ◽

Domain Name ◽

Domain Names ◽

General Rules ◽

Explicit Reference ◽

Trademark Protection

Statutory provisions dealing specifically with domain names are found in the Codice della Proprietà Industriale (‘the CPI’),1 ss 12(1)(c), 22, 118(6), and 133. Sections 12(1)(c) and 22 define the scope of trademark protection. In doing so, they also consider interference with domain names that are used in the course of a business activity (nomi a dominio aziendali). Sections 118(6) and 133 deal with remedies for trademark infringements and make explicit reference to domain names as well. Besides these specific rules, conflicts before the Italian courts based on domain name registrations are to be resolved according to the general rules of trademark, competition, and civil law.

Download Full-text

Design of Chinese Word Segmentation System Based on Improved Chinese Converse Dictionary and Reverse Maximum Matching Algorithm

Web Information Systems – WISE 2006 Workshops - Lecture Notes in Computer Science ◽

10.1007/11906070_17 ◽

2006 ◽

pp. 171-181 ◽

Cited By ~ 2

Author(s):

Liyi Zhang ◽

Yazi Li ◽

Jian Meng

Keyword(s):

Word Segmentation ◽

Maximum Matching ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Matching Algorithm

Download Full-text

Malicious Domain Names Detection Algorithm Based on N-Gram

Journal of Computer Networks and Communications ◽

10.1155/2019/4612474 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Hong Zhao ◽

Zhaobin Chang ◽

Guangbin Bao ◽

Xiangyan Zeng

Keyword(s):

False Positive Rate ◽

False Negative ◽

False Negative Rate ◽

Detection Algorithm ◽

Internet Security ◽

Domain Name ◽

Domain Names ◽

Malicious Attack ◽

Positive Rate ◽

N Gram

Malicious domain name attacks have become a serious issue for Internet security. In this study, a malicious domain names detection algorithm based on N-Gram is proposed. The top 100,000 domain names in Alexa 2013 are used in the N-Gram method. Each domain name excluding the top-level domain is segmented into substrings according to its domain level with the lengths of 3, 4, 5, 6, and 7. The substring set of the 100,000 domain names is established, and the weight value of a substring is calculated according to its occurrence number in the substring set. To detect a malicious attack, the domain name is also segmented by the N-Gram method and its reputation value is calculated based on the weight values of its substrings. Finally, the judgment of whether the domain name is malicious is made by thresholding. In the experiments on Alexa 2017 and Malware domain list, the proposed detection algorithm yielded an accuracy rate of 94.04%, a false negative rate of 7.42%, and a false positive rate of 6.14%. The time complexity is lower than other popular malicious domain names detection algorithms.

Download Full-text

A Novel Method to Restrain Ambient Noise Interference in Transformer Noise Test

IOP Conference Series Earth and Environmental Science ◽

10.1088/1755-1315/692/3/032081 ◽

2021 ◽

Vol 692 (3) ◽

pp. 032081

Author(s):

Xiaowen Wu ◽

Gang Li ◽

Hao Cao ◽

Ling Lu

Keyword(s):

Ambient Noise ◽

Noise Interference ◽

Novel Method ◽

Noise Test

Download Full-text

Online Brands and Trademark Conflicts: A Hegelian Perspective

Business Ethics Quarterly ◽

10.5840/beq200616326 ◽

2006 ◽

Vol 16 (3) ◽

pp. 343-367 ◽

Cited By ~ 9

Author(s):

Richard A. Spinello

Keyword(s):

Free Speech ◽

Strong Support ◽

The Internet ◽

Domain Name ◽

Reasonable Person ◽

Domain Names ◽

Consumer Confusion ◽

Trademark Dilution ◽

Online Brands ◽

Free Speech Rights

Abstract:The Internet presents opportunities for corporations to efficiently build their brands online and to enhance their global reach. But there are threats as well as opportunities, since anti-branding and free-riding activities are easier in cyberspace. One such threat is the unauthorized incorporation of a trademark into a domain name. This can lead to trademark dilution and cause consumer confusion. But some users claim a right to use these trademarks for the purpose of parody or criticism. Underlying these trademark conflicts is the familiar tension between property rights and free speech rights. While some trademark scholars are reluctant to consider a trademark as property, we find strong support for the property paradigm in Hegel’s philosophy. Assuming that a trademark is an earned property right, we propose that a trademark owner should be allowed to control the permutations of its trademark incorporated into domain names unless a reasonable person would not confuse that domain name with the company’s mark. But we also conclude that there must be latitude to employ a domain name for negative editorial comment, so long as the source and purpose of that domain name is plainly apparent.

Download Full-text

WORD SEGMENTATION OF OUTPUT RESPONSE FOR SIGN LANGUAGE DEVICES

IIUM Engineering Journal ◽

10.31436/iiumej.v21i2.1408 ◽

2020 ◽

Vol 21 (2) ◽

pp. 153-163

Author(s):

Nor Farahidah Za'bah ◽

Ahmad Amierul Ashraf Muhammad Nazmi ◽

Amelia Wong Azman

Keyword(s):

Dynamic Programming ◽

Sign Language ◽

English Language ◽

Word Segmentation ◽

Input String ◽

Text Segmentation ◽

Segmentation Method ◽

Text Input ◽

Acceptable Accuracy ◽

Language Text

Segmentation is an important aspect of translating finger spelling of sign language into Latin alphabets. Although the sign language devices that are currently available can translate the finger spelling into alphabets, there is a limitation where the output is stored in a long continuous string without spaces between words. The system proposed in this work is meant to be used together with a text-generating glove device. The system used text input string and the string is then fed into the system, one character at a time, and then it is segmented into words that is semantically correct. The proposed text segmentation method in this work is by using the dynamic programming and back-off algorithm, together with the probability score using word matching with an English language text corpus. Based on the results, the system is able to properly segment words with acceptable accuracy. ABSTRAK: Segmentasi adalah aspek penting dalam menterjemahkan ejaan bahasa isyarat ke dalam huruf Latin. Walaupun terdapat peranti bahasa isyarat yang menterjemahkan ejaan jari menjadi huruf, namun begitu, huruf-huruf yang dihasilkan disimpan dalam rentetan berterusan yang panjang tanpa jarak antara setiap perkataan. Sistem yang dicadangkan di dalam jurnal ini akan diselaraskan bersama dengan sarung tangan bahasa isyarat yang boleh menghasilkan teks. Sistem ini akan mengambil rentetan input teks di mana huruf akan dimasukkan satu persatu dan huruf-huruf itu akan disegmentasikan menjadi perkataan yang betul secara semantik. Kaedah pembahagian yang dicadangkan ialah segmentasi yang menggunakan pengaturcaraan dinamik dan kaedah kebarangkalian untuk mengsegmentasikan huruf-huruf tersebut berdasarkan padanan perkataan dengan pengkalan data di dalam Bahasa Inggeris. Berdasarkan hasil yang telah diperolehi, sistem ini berjaya mengsegmentasikan huruf-huruf tersebut dengan berkesan dan tepat.

Download Full-text