Vista_LLaMA

Reliable Video Narrator via Equal Distance to Visual Tokens

¹Zhejiang University ²ByteDance Inc.
^*Indicates Corresponding Author.

Abstract

Recent advances in large video-language models have displayed promising outcomes in video comprehension. Current approaches straightforwardly convert video into language tokens and employ large language models for multi-modal tasks. However, this method often leads to the generation of irrelevant content, commonly known as ``hallucination'', as the length of the text increases and the impact of the video diminishes. To address this problem, we propose Vista-LLaMA, a novel framework that maintains the consistent distance between all visual tokens and any language tokens, irrespective of the generated text length. Vista-LLaMA omits relative position encoding when determining attention weights between visual and text tokens, retaining the position encoding for text and text tokens. This amplifies the effect of visual tokens on text generation, especially when the relative distance is longer between visual and text tokens. The proposed attention mechanism significantly reduces the chance of producing irrelevant text related to the video content. Furthermore, we present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame. This approach not only captures the temporal relationship within the video, but also allows less visual tokens to encompass the entire video. Our approach significantly outperforms various previous methods(e.g., Video-ChatGPT, MovieChat) on four challenging open-ended video question answering benchmarks. We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot MSRVTT-QA, setting a new state-of-the-art performance.

Comparison with SOTA

Method	NExT-QA	MSVD-QA	MSRVTT-QA	ActivityNet-QA
FrozenBiLM	-	-	32.2	-	16.8	-	24.7	-
Video Chat	56.2	3.2	56.3	2.8	45.0	2.5	26.5	2.2
LLaMA Adapter	-	-	54.9	3.1	43.8	2.7	34.2	2.7
Video LLaMA	-	-	51.6	2.5	29.6	1.8	12.4	1.1
MovieChat	49.9	2.7	61.0	2.9	49.7	2.8	51.5	3.1
Video-ChatGPT	54.6	3.2	64.9	3.3	49.3	2.8	35.2	2.7
Model (Ours)	60.7	3.4	65.3	3.6	60.5	3.3	48.3	3.3

Method

NExT-QA

MSVD-QA

MSRVTT-QA

ActivityNet-QA

Accuracy

Score

Accuracy

Score

Accuracy

Score

Accuracy

Score

FrozenBiLM

32.2

16.8

24.7

Video Chat

56.2

3.2

56.3

2.8

45.0

2.5

26.5

2.2

LLaMA Adapter

54.9

3.1

43.8

2.7

34.2

2.7

Video LLaMA

51.6

2.5

29.6

1.8

12.4

1.1

MovieChat

49.9

2.7

61.0

2.9

49.7

2.8

51.5

3.1

Video-ChatGPT

54.6

3.2

64.9

3.3

49.3

2.8

35.2

2.7

Model (Ours)

60.7

3.4

65.3

3.6

60.5

3.3

48.3

3.3

BibTeX

@misc{ma2023vistallama, title={Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens}, author={Fan Ma and Xiaojie Jin and Heng Wang and Yuchen Xian and Jiashi Feng and Yi Yang}, year={2023}, eprint={2312.08870}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Vista-LLaMA:

Reliable Video Narrator via Equal Distance to Visual Tokens

Abstract

Our Framework

Comparison with SOTA

Comparison with SoTA methods on zero-shot VideoQA.

Comparison of attention weights for varing context lengths in different layers. Lighter colors represent higher weights. To improve clarity, we have combined visual token weights into the first four tokens. We recommend zooming in for optimal viewing.

Our Results

Visualization results on different video questions. The questions and annotated answers are located on the left side. The generated text from Video-ChatGPT and our model is presented in the green and orange boxes, respectively. Read our paper for more details.

Visualization results on different video questions.

Visualization results on different video questions.

CineClipQA Dataset

CineClipQA Dataset to be proposed for movie QA. It is a novel dataset meticulously crafted to probe the capabilities of visual language models in comprehending and interpreting plot-driven video content.

BibTeX