Abstract:
To address the challenges of episodic few-shot visual question answering, we propose a Frequency-based Cross-modal Chain-of-Thought Inference Network (FC-CoT-In). The method adaptively filters image content through frequency-domain information, designs an Adaptive High-Low Frequency Separation Module (AHLS) to decouple semantic representations at different levels in the visual modality, and constructs a multimodal high-low frequency chain-of-thought reasoning mechanism. By leveraging a cognitive bank to store intermediate results of cross-modal reasoning and combining a dual-path chain of thought, the approach fully exploits filtered visual information to achieve more accurate episodic few-shot visual question answering. Experiments demonstrate that the proposed method significantly enhances the model's capability in handling eposodic Few-shot Visual Question Answering (EFSVQA) tasks, improves the accuracy and generalization performance of cross-modal reasoning, and outperforms comparable algorithms on public benchmarks.