An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought Reasoning

WEI Yifan; ZHANG Jing

doi:10.14135/j.cnki.1006-3080.20250910003

WEI Yifan, ZHANG Jing. An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought ReasoningJ. Journal of East China University of Science and Technology. DOI: 10.14135/j.cnki.1006-3080.20250910003

Citation:

An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought Reasoning

Graphical Abstract

Graphical Abstract

Abstract

Abstract

To address the challenges of episodic few-shot visual question answering, we propose a Frequency-enhanced Cross-modal Chain of Thought Inference Network. The method adaptively filters image content through frequency-domain information, designs an adaptive high-low frequency separation module to decouple semantic representations at different levels in the visual modality, and constructs a multimodal high-low frequency chain-of-thought reasoning mechanism. By leveraging a cognitive bank to store intermediate results of cross-modal reasoning and combining a dual-path chain of thought, the approach fully exploits filtered visual information to achieve more accurate episodic few-shot visual question answering. Experiments demonstrate that the proposed method significantly enhances the model's capability in handling few-shot VQA tasks, improves the accuracy and generalization performance of cross-modal reasoning, and outperforms comparable algorithms on public benchmarks.

FullText(HTML)

References (37)

Cited By

An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought Reasoning

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content