A transformer-based Urdu image caption generation - STORE - University of Staffordshire Online Repository

Lists

Tools

Hadi, Muhammad, Safder, Iqra, Waheed, Hajra, Zaman, Farooq, Aljohani, Naif Radi, NAWAZ, Raheel, Hassan, Saeed Ul and Sarwar, Raheem (2024) A transformer-based Urdu image caption generation. Journal of Ambient Intelligence and Humanized Computing, 15 (9). pp. 3441-3457. ISSN 1868-5137

Preview

Text
s12652-024-04824-9.pdf - Publisher's typeset copy
Available under License Type Creative Commons Attribution 4.0 International (CC BY 4.0) .
Download (3MB) | Preview

Official URL: http://dx.doi.org/10.1007/s12652-024-04824-9

Abstract or description

Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings. © The Author(s) 2024.

Item Type:	Article
Uncontrolled Keywords:	Caption generation; Deep learning; Natural Language processing; NLP; Transformers; Urdu image; caption generation
Faculty:	Executive
Depositing User:	Raheel NAWAZ
Date Deposited:	11 Sep 2024 15:30
Last Modified:	11 Sep 2024 15:55
URI:	https://eprints.staffs.ac.uk/id/eprint/8436

Actions (login required)

: View Item