Introduction of scoring for author identification by text mining: Effects of the number of characters and texts, and the features of writing style

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.

Download Full-text

Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

Expert Systems with Applications ◽

10.1016/j.eswa.2012.12.082 ◽

2013 ◽

Vol 40 (9) ◽

pp. 3756-3763 ◽

Cited By ~ 42

Author(s):

Gabriel Oberreuter ◽

Juan D. Velásquez

Keyword(s):

Text Mining ◽

Writing Style ◽

Plagiarism Detection

Download Full-text

Non-word Attributes’ Efficiency in Text Mining Authorship Prediction

Journal of Intelligent Systems ◽

10.1515/jisys-2019-0068 ◽

2019 ◽

Vol 29 (1) ◽

pp. 1408-1415

Author(s):

Tareef Kamil Mustafa

Keyword(s):

Text Mining ◽

Word Pair ◽

Prediction Accuracy ◽

Sentence Length ◽

Machine Language ◽

Writing Style ◽

Single Word ◽

Word Attributes ◽

Style Of Writing ◽

Word Frequencies

Abstract Literature scripts can be compared to paintings, in an artistic way as well as in the perspective of financial value, whereas the value of these scripts rise and fall depending on their author’s popularity. Authors’ scripts represent a specific style of writing that can be measured and compared using a text mining field called Stylometric. Stylometric analysis depends on some features called authorship attributes, and these attributes or features can be used in special algorithms and methods to reach that aim. Generally, each method selected in the Stylometric field uses a variety of attributes to reach higher prediction accuracy. The aim of this research is to improve the accuracy of authorship prediction in literary works based on the artistic writing style of the authors. To achieve that, a new set of attributes will be used with the Stylometric Authorship Balanced Attribution method, which was chosen in this research among several other machine language methods because of its delicateness in authorship prediction projects. The attributes that have been used by most of the researchers were word frequencies (single word, pair of words, or trio of words), which led to some prediction mistakes. In this research, a new set of attributes is used to decrease these mistakes. These proposed non-word attributes are named sentence length, special characters, and punctuation symbols. The results obtained by using these proposed attributes were excellent.

Download Full-text

Finding Efficient Linguistic Feature Set for Authorship Verification

Journal of Computer Science ◽

10.31357/jcs.v1i1.1616 ◽

2013 ◽

Vol 1 (1) ◽

pp. 35-43

Author(s):

Sandaruwan Prabath Kumara Ranatunga

Keyword(s):

Language Processing ◽

Writing Style ◽

Linguistic Features ◽

Self Organizing Maps ◽

Linguistic Feature ◽

Feature Extracting ◽

Author Identification ◽

Authorship Verification ◽

New Feature ◽

The Given

Authorship verification rely on identification of a given document is written by a particular author or not. Internally analyzing the document itself with respect to the variations in writing style of the author and identification of the author’s own idiolect is the main context of the authorship verification. Mainly, the detection performance depends on the used feature set for clustering the document. Linguistic features and stylistic features have been utilized for author identification according to the writing style of a particular author. Disclose the shallow changes of the author’s writing style is the major problem which should be addressed in the domain of authorship verification. It motivates the computer science researchers to do research on authorship verification in the field of computer forensics and this research also focuses this problem. The contributions from the research are two folded: Former is introducing a new feature extracting method with Natural Language Processing (NLP) and later is propose a new more efficient linguistic feature set for verification of author of the given document. Experiments on a corpus composed of freely downloadable genuine 19th century English Books and Self Organizing Maps has been used as the classifier to cluster the documents. Proper word segmentation also introduced in this work and it helps to demonstrate that the proposed strategy can produced promising results. Finally, it is realized that more accurate classification is generated by the proposed strategy with extracted linguistic feature set.

Download Full-text

Author identification with feature transformation method

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz052 ◽

2019 ◽

Author(s):

Mubin Shoukat Tamboli ◽

Rajesh Prasad

Keyword(s):

Learning Algorithm ◽

Online Communication ◽

Critical Issue ◽

Transformation Method ◽

Writing Style ◽

Current Time ◽

Author Identification ◽

New Feature ◽

Two Phases ◽

Authorship Identification

Abstract Over the last few decades, there has been tremendous growth in online communication through different types of media. Communication via the Internet is anonymous, which causes a critical issue regarding identity tracing. Authorship identification can apply to tasks such as identifying an anonymous author, detecting plagiarism, or finding a ghostwriter. Previous research has outlined the various methods and their improvements for the identification of anonymous authors based on stylometry. However, changes in the writing style of an author over a long period has not been addressed. In this article, we propose a methodology for author identification where the writing style of an author changes. The proposed methodology consists of two phases: the first will show the change in writing style of the author and in another phase the change is mitigated by a new feature normalization technique. A novel Transform Feature to Current Time function is proposed for normalization, where features are shifted to current time and made available for further classification. A machine-learning algorithm is used to identify an author candidate. The experiments of the proposed methodology conducted on a set of text samples by several authors were collected over a different time period and the results show an improvement in performance.

Download Full-text

Using Text Mining Algorithm to Detect Gender Deception based on Malaysian Chat Room Lingo

Social and Management Research Journal ◽

10.24191/smrj.v3i1.5097 ◽

1970 ◽

Vol 3 (1) ◽

pp. 11

Author(s):

Dianne Mei Cheong Lee ◽

Nur Atiqah Sia Abdullah

Keyword(s):

Text Mining ◽

Virtual World ◽

Virtual Communities ◽

Visual Basic ◽

Chat Room ◽

Writing Style ◽

Mining Algorithm ◽

Accuracy Level ◽

E Mail ◽

At Will

E-mail can be a fantasy playground for identity experimentations where players take on an imaginary persona and interact with each other in the virtual world. Therefore, gender deception is difficult, risky and it can be abandoned at will. Inference can be made both from writing style and from clues hidden in the posting data. A text-mining algorithm was designed to detect gender deception based on gender-preferential features at the word or clause level of Malaysian e-mail users. Based on this algorithm, a prototype in Visual Basic is developed. It was tested with 16 documents; each consists of five e-mails exchanges of respective individuals. The tests shown the prototype have 81.3% of accuracy level. This prototype can be a tool to assist interested parties such as the Criminology and Forensic Department, e-mail users interested parties such as the Criminology and Forensic Department, e-mail users and virtual communities to successfully identify gender deception.

Download Full-text

Accuracy and Standardized Judgment Procedures for Author Identification by Text Mining

Kodo Keiryogaku (The Japanese Journal of Behaviormetrics) ◽

10.2333/jbhmk.45.39 ◽

2018 ◽

Vol 45 (1) ◽

pp. 39-47

Author(s):

Wataru Zaitsu ◽

Mingzhe Jin

Keyword(s):

Text Mining ◽

Author Identification

Download Full-text

Documenting Clinical Service Delivery: Writing Style and Lexical Selection

Contemporary Issues in Communication Science and Disorders ◽

10.1044/cicsd_27_s_6 ◽

2000 ◽

Vol 27 (Spring) ◽

pp. 6-13

Author(s):

Dorian Lee Wilkerson

Keyword(s):

Service Delivery ◽

Clinical Service ◽

Lexical Selection ◽

Writing Style

Download Full-text

Writing in Aphasia Rehabilitation: Cursive vs Manuscript

Journal of Speech and Hearing Disorders ◽

10.1044/jshd.4104.523 ◽

1976 ◽

Vol 41 (4) ◽

pp. 523-529 ◽

Cited By ~ 3

Author(s):

Daniel R. Boone ◽

Harold M. Friedman

Keyword(s):

Correct Response ◽

The Other ◽

Read Aloud ◽

Writing Style ◽

Reading And Writing ◽

Manual Responses ◽

Significant Difference ◽

Written Form ◽

The Individual ◽

Read Number

Reading and writing performance was observed in 30 adult aphasic patients to determine whether there was a significant difference when stimuli and manual responses were varied in the written form: cursive versus manuscript. Patients were asked to read aloud 10 words written cursively and 10 words written in manuscript form. They were then asked to write on dictation 10 word responses using cursive writing and 10 words using manuscript writing. Number of words correctly read, number of words correctly written, and number of letters correctly written in the proper sequence were tallied for both cursive and manuscript writing tasks for each patient. Results indicated no significant difference in correct response between cursive and manuscript writing style for these aphasic patients as a group; however, it was noted that individual patients varied widely in their success using one writing form over the other. It appeared that since neither writing form showed better facilitation of performance, the writing style used should be determined according to the individual patient’s own preference and best performance.

Download Full-text