Sarwar, Raheem, Perera, Maneesha, Teh, Pin Shen, NAWAZ, Raheel and Hassan, Muhammad Umair (2024) Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts. ACM Transactions on Asian and Low-Resource Language Information Processing, 23 (5). pp. 1-14. ISSN 2375-4699
3655620.pdf - Publisher's typeset copy
Available under License Type Creative Commons Attribution 4.0 International (CC BY 4.0) .
Download (1MB) | Preview
Abstract or description
Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. The author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages such as Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of a small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author. © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Item Type: | Article |
---|---|
Faculty: | Executive |
Depositing User: | Raheel NAWAZ |
Date Deposited: | 11 Sep 2024 15:28 |
Last Modified: | 11 Sep 2024 16:01 |
URI: | https://eprints.staffs.ac.uk/id/eprint/8438 |