Uncategorized

Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal Large Language Models



Download a PDF of the paper titled ViCrop: Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal Large Language Models, by Jiarui Zhang and 3 other authors

Download PDF
HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) — a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether MLLMs can perceive details as well as larger components in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject related to the question, declining up to $45.91\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. To scale up the usefulness of human cropping, we propose ViCrop, a general framework that utilizes automatic visual cropping to enhance zero-shot VQA of MLLMs. We construct five variants of ViCrop leveraging either external localization models or the decision process of the given MLLM itself. Our results show that ViCrop improves MLLMs’ zero-shot accuracy across different VQA datasets, for example, enhances BLIP2-T5’s performance by $32.23\%$ on the TextVQA test set. To facilitate further investigation of MLLMs’ behaviors, our code is publicly released.

Submission history

From: Jiarui Zhang [view email]
[v1]
Tue, 24 Oct 2023 17:48:04 UTC (19,612 KB)
[v2]
Mon, 1 Jan 2024 23:50:31 UTC (9,692 KB)



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *