scholarly journals Introduction of scoring for author identification by text mining: Effects of the number of characters and texts, and the features of writing style

2017 ◽  
Vol 22 (2) ◽  
pp. 91-108
Author(s):  
Wataru Zaitsu ◽  
Mingzhe Jin
Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.


2019 ◽  
Vol 29 (1) ◽  
pp. 1408-1415
Author(s):  
Tareef Kamil Mustafa

Abstract Literature scripts can be compared to paintings, in an artistic way as well as in the perspective of financial value, whereas the value of these scripts rise and fall depending on their author’s popularity. Authors’ scripts represent a specific style of writing that can be measured and compared using a text mining field called Stylometric. Stylometric analysis depends on some features called authorship attributes, and these attributes or features can be used in special algorithms and methods to reach that aim. Generally, each method selected in the Stylometric field uses a variety of attributes to reach higher prediction accuracy. The aim of this research is to improve the accuracy of authorship prediction in literary works based on the artistic writing style of the authors. To achieve that, a new set of attributes will be used with the Stylometric Authorship Balanced Attribution method, which was chosen in this research among several other machine language methods because of its delicateness in authorship prediction projects. The attributes that have been used by most of the researchers were word frequencies (single word, pair of words, or trio of words), which led to some prediction mistakes. In this research, a new set of attributes is used to decrease these mistakes. These proposed non-word attributes are named sentence length, special characters, and punctuation symbols. The results obtained by using these proposed attributes were excellent.


2013 ◽  
Vol 1 (1) ◽  
pp. 35-43
Author(s):  
Sandaruwan Prabath Kumara Ranatunga

Authorship verification rely on identification of a given document is written by a particular author or not. Internally analyzing the document itself with respect to the variations in writing style of the author and identification of the author’s own idiolect is the main context of the authorship verification. Mainly, the detection performance depends on the used feature set for clustering the document. Linguistic features and stylistic features have been utilized for author identification according to the writing style of a particular author. Disclose the shallow changes of the author’s writing style is the major problem which should be addressed in the domain of authorship verification. It motivates the computer science researchers to do research on authorship verification in the field of computer forensics and this research also focuses this problem. The contributions from the research are two folded: Former is introducing a new feature extracting method with Natural Language Processing (NLP) and later is propose a new more efficient linguistic feature set for verification of author of the given document. Experiments on a corpus composed of freely downloadable genuine 19th century English Books and Self Organizing Maps has been used as the classifier to cluster the documents. Proper word segmentation also introduced in this work and it helps to demonstrate that the proposed strategy can produced promising results. Finally, it is realized that more accurate classification is generated by the proposed strategy with extracted linguistic feature set.


Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Abstract Over the last few decades, there has been tremendous growth in online communication through different types of media. Communication via the Internet is anonymous, which causes a critical issue regarding identity tracing. Authorship identification can apply to tasks such as identifying an anonymous author, detecting plagiarism, or finding a ghostwriter. Previous research has outlined the various methods and their improvements for the identification of anonymous authors based on stylometry. However, changes in the writing style of an author over a long period has not been addressed. In this article, we propose a methodology for author identification where the writing style of an author changes. The proposed methodology consists of two phases: the first will show the change in writing style of the author and in another phase the change is mitigated by a new feature normalization technique. A novel Transform Feature to Current Time function is proposed for normalization, where features are shifted to current time and made available for further classification. A machine-learning algorithm is used to identify an author candidate. The experiments of the proposed methodology conducted on a set of text samples by several authors were collected over a different time period and the results show an improvement in performance.


1970 ◽  
Vol 3 (1) ◽  
pp. 11
Author(s):  
Dianne Mei Cheong Lee ◽  
Nur Atiqah Sia Abdullah

E-mail can be a fantasy playground for identity experimentations where players take on an imaginary persona and interact with each other in the virtual world. Therefore, gender deception is difficult, risky and it can be abandoned at will. Inference can be made both from writing style and from clues hidden in the posting data. A text-mining algorithm was designed to detect gender deception based on gender-preferential features at the word or clause level of Malaysian e-mail users. Based on this algorithm, a prototype in Visual Basic is developed. It was tested with 16 documents; each consists of five e-mails exchanges of respective individuals. The tests shown the prototype have 81.3% of accuracy level. This prototype can be a tool to assist interested parties such as the Criminology and Forensic Department, e-mail users interested parties such as the Criminology and Forensic Department, e-mail users and virtual communities to successfully identify gender deception.


1976 ◽  
Vol 41 (4) ◽  
pp. 523-529 ◽  
Author(s):  
Daniel R. Boone ◽  
Harold M. Friedman

Reading and writing performance was observed in 30 adult aphasic patients to determine whether there was a significant difference when stimuli and manual responses were varied in the written form: cursive versus manuscript. Patients were asked to read aloud 10 words written cursively and 10 words written in manuscript form. They were then asked to write on dictation 10 word responses using cursive writing and 10 words using manuscript writing. Number of words correctly read, number of words correctly written, and number of letters correctly written in the proper sequence were tallied for both cursive and manuscript writing tasks for each patient. Results indicated no significant difference in correct response between cursive and manuscript writing style for these aphasic patients as a group; however, it was noted that individual patients varied widely in their success using one writing form over the other. It appeared that since neither writing form showed better facilitation of performance, the writing style used should be determined according to the individual patient’s own preference and best performance.


Sign in / Sign up

Export Citation Format

Share Document