An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought Reasoning
-
Graphical Abstract
-
Abstract
To address the challenges of episodic few-shot visual question answering, we propose a Frequency-enhanced Cross-modal Chain of Thought Inference Network. The method adaptively filters image content through frequency-domain information, designs an adaptive high-low frequency separation module to decouple semantic representations at different levels in the visual modality, and constructs a multimodal high-low frequency chain-of-thought reasoning mechanism. By leveraging a cognitive bank to store intermediate results of cross-modal reasoning and combining a dual-path chain of thought, the approach fully exploits filtered visual information to achieve more accurate episodic few-shot visual question answering. Experiments demonstrate that the proposed method significantly enhances the model's capability in handling few-shot VQA tasks, improves the accuracy and generalization performance of cross-modal reasoning, and outperforms comparable algorithms on public benchmarks.
-
-