Rajesh P. Chinchewadi

Work place: Manipur International University, Imphal, Manipur, India

E-mail: rajesh.cto@miu.edu.in

Website:

Research Interests:

Biography

Rajesh P. Chinchewadi is CTO & Dean Innovation Manipur International University, Imphal, Manipur. His research focuses on the development of computational methods for scalable and responsible discovery science. He has to his credit of publishing number of research papers including in International and National Journals.

Author Articles
Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search

By Sushma Jaiswal Harikumar Pallthadka Rajesh P. Chinchewadi Tarun Jaiswal

DOI: https://doi.org/10.5815/ijisa.2024.02.05, Pub. Date: 8 Apr. 2024

Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.

[...] Read more.
Other Articles