Video question answering with large language models： a survey

Video question answering with large language models： a survey

PDF

Junlin Xie¹, Ruifei Zhang¹, Guanbin Li²^,^*

Journal of Image and Graphics | 2025, 30(12) : 3760 - 3781

Less

Journal of Image and Graphics | 2025, 30(12): 3760-3781

• Review •

Video question answering with large language models： a survey

Full

Junlin Xie¹, Ruifei Zhang¹, Guanbin Li²^,^*

Affiliations

¹School of Science and Engineer， The Chinese University of Hong Kong （Shenzhen）， Shenzhen518116，China

²School of Computer Science and Engineer， Sun Yat-sen University， Guangzhou510006，China

Published: 2025-12-16 doi: 10.11834/jig.240535

Outline

Abstract

Less

In recent years， large language models （LLMs） have achieved remarkable progress in natural language processing （NLP）， demonstrating exceptional capabilities in language understanding and generation. These advancements have driven widespread applications in tasks such as text generation， machine translation， question answering， text summarization， and text classification. However， despite their impressive performance in handling and generating text， LLMs face notable limitations when handling highly complex multimodal tasks， particularly in the domain of video question answering （Video QA）. Video QA is a particularly challenging task that requires models to comprehend and generate responses based on dynamic visual content， which often includes temporal and auditory information. Unlike static images or purely textual contents， video data contains inherent temporal dependencies， where the meaning of events and actions unfolds over time. This temporal dimension adds substantial complexity to the understanding process because models must not only interpret individual frames but also maintain coherent understanding across sequences of frames within the broader video context. Consequently， effective Video QA demands advanced temporal information processing capabilities that many LLMs， primarily designed for static text， often struggle to handle adequately. Moreover， the multimodal nature of video， which often involves the integration of visual， auditory， and occasionally textual cues， further complicates the task. Effective Video QA requires the model to seamlessly fuse information across these different modalities， ensuring accurate interpretation and response to questions regarding video content. This process involves understanding visual scenes， recognizing speech or background sounds， and correlating them with the corresponding textual information. The challenge lies not only in processing each modality independently but also in establishing meaningful connections between them to generate coherent and contextually appropriate responses. This paper presents a comprehensive review of the current state of research on Video QA models based on large language models. The technical characteristics， strengths， and weaknesses of non-real-time and real-time Video QA models are also investigated. Non-real-time Video QA models typically operate on pre-recorded video content， allowing them to access and analyze the entire video sequence before generating responses. These models can leverage global contextual information， making such models particularly effective for tasks that require video content analysis， such as video summarization or detailed scene interpretation. However， they may struggle with efficiency and scalability， particularly when handling long videos or large datasets. In contrast， real-time Video QA models are designed to process video streams as they are received， increasing their suitability for applications requiring immediate responses， such as live video monitoring or interactive video systems. However， these models must maintain a balance between processing speed and accuracy due to their frequently limited access to the full temporal context of the video. The paper discusses the challenges encountered by these models in maintaining performance under real-time constraints， including efficient computation and prediction capability based on partial information. Additionally， the paper explores the commonly used datasets in Video QA research， highlighting their features， limitations， and the types of tasks they are designed to address. The evaluation of Video QA models is also examined， focusing on the metrics and benchmarks used to assess their performance. Understanding the strengths and weaknesses of different datasets is crucial for advancing the field， helping in the identification of gaps in current research and guiding the development of robust and versatile models. Finally， the paper addresses the extensive challenges and bottlenecks in the field of Video QA， including the difficulties in scaling models to handle large and diverse video datasets， the need for efficient multimodal fusion techniques， and the computational demands associated with video data processing in real-time. The discussion is further extended to consider the potential future research directions in Video QA， with particular emphasis on improving the temporal reasoning capabilities of LLMs， enhancing their multimodal integration， and developing efficient model architectures that can operate effectively under resource constraints. Overall， while large language models have presented new possibilities in the field of video interpretation， considerable challenges remain in adapting these models to the specific demands of Video QA. Through the systematic review of the current advancements and the presentation of the key obstacles and future directions， this paper aims to contribute to the ongoing efforts to develop highly capable and intelligent multimodal AI systems. The field must continue innovations in the following areas： temporal modeling， where novel architectures that can effectively capture long-range dependencies in video sequences are needed； multimodal representation learning， where sophisticated approaches for integrating visual， auditory， and textual features could yield substantial improvements. Furthermore， the development of highly efficient training paradigms that can address the computational intensity of video processing while retaining model performance is essential for practical applications. Another critical area for future work focuses on the creation of highly comprehensive and challenging benchmark datasets that effectively reflect real-world scenarios， pushing the boundaries of what current models can achieve. As research in this area progresses， addressing these challenges will be crucial for realizing the full potential of LLMs in video interpretation applications. Achieving this goal will require AI systems that can interpret and reason about dynamic visual content with a level of proficiency comparable to human cognition. The integration of advanced techniques from computer vision， speech processing， and natural language understanding will be pivotal in developing truly multimodal systems capable of managing the complexity and variability in real-world video data. Through continued innovation and interdisciplinary collaboration， the field can overcome current limitations and drive the development of next-generation video understanding technologies with broad applicability across domains such as education， entertainment， surveillance， and human-computer interaction.

Key words

large language models（LLMs） / video question answering（Video QA） / multimodal information fusion / temporal information processing / video understanding

Cite this Article

Junlin Xie, Ruifei Zhang, Guanbin Li. Video question answering with large language models： a survey[J]. Journal of Image and Graphics, 2025 , 30 (12) : 3760 -3781 . DOI: 10.11834/jig.240535

Appendix

Less

Year 2025 volume 30 Issue 12

PDF

133

Cite this Article

BibTeX

Article Info

doi: 10.11834/jig.240535

Receive Date：2024-09-06
Online Date：2026-04-09
Published：2025-12-16

Article Data

Affiliations

History

Received：2024-09-06
Revised：2025-04-20

Affiliations

¹School of Science and Engineer， The Chinese University of Hong Kong （Shenzhen）， Shenzhen518116，China

²School of Computer Science and Engineer， Sun Yat-sen University， Guangzhou510006，China

References

Share

https://castjournals.cast.org.cn/joweb/zgtxtxxb/EN/10.11834/jig.240535

Share to

Scan QR to access full text

Cite this article

BibTeX

Citations

表12种不同金属材料的力学参数

科 Family	属数 Number of genus	种数 Number of species	占总种数比例 Percentage of total species (%)	属 Genus	种数 Number of species	占总种数比例 Percentage of total species (%)
鹅膏菌科Amanitaceae	2	11	5.26	鹅膏菌属 Amanita	10	4.78
小菇科 Mycenaceae	2	12	5.74	丝盖伞属 Inocybe	5	2.39
多孔菌科 Polyporaceae	8	14	6.70	蜡蘑属 Laccaria	5	2.39
红菇科 Russulaceae	3	23	11.00	小皮伞属 Marasmius	6	2.87
				小菇属 Mycena	11	5.26
				光柄菇属 Pluteus	5	2.39
				红菇属 Russula	17	8.13
				栓菌属 Trametes	5	2.39

关闭全屏

BibTeX
EndNote
RefWorks
TxT

Articles: Latest Articles; Most Read; Collections

Updates: Events; News; Multimedia

About: About Us

Contact

No. 86 Xueyuan South Road, Haidian District, Beijing

100081

010-62199257

qkjq@cast.org.cn

Copyright © 2025 China Association for Science and Technology. All rights reserved. For all open access content, the relevant licensing terms apply.
Sponsored by the Office of the Leading Group for Cybersecurity and Informatization of CAST, and supported by Science and Technology Review Publishing House