多频思维链推理的情景式小样本视觉问答方法

魏一凡; 张静

doi:10.14135/j.cnki.1006-3080.20250910003

多频思维链推理的情景式小样本视觉问答方法

魏一凡,
张静

Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Chain-of-Thought Reasoning

摘要

摘要: 现有情景式小样本视觉问答方法多依赖于空间域视觉特征，忽略了频域信息在解耦全局与局部语义中的潜力，导致模型对图像内容的理解层次不足。此外，传统推理机制难以实现跨频率的渐进式交互，限制了跨模态推理能力的进一步提升。针对情景式小样本视觉问答任务，提出了一种基于多频思维链的跨模态推理网络。该方法通过频率域信息对图像内容进行深度过滤，设计自适应高低频分离模块解耦视觉模态中不同层次的语义表达，并构建多模态高低频思维链推理机制，利用认知银行存储跨模态推理中间结果，结合双路径思维链充分挖掘过滤后的视觉信息，以实现更准确的情景式小样本视觉问答。实验表明，该方法显著增强了模型对小样本视觉问答任务的处理能力，提升了跨模态推理的准确性和泛化性能，在公开数据集上的实验结果优于同类算法。

Abstract: To address the challenges of episodic few-shot visual question answering, we propose a Frequency-based Cross-modal Chain-of-Thought Inference Network (FC-CoT-In). The method adaptively filters image content through frequency-domain information, designs an Adaptive High-Low Frequency Separation Module (AHLS) to decouple semantic representations at different levels in the visual modality, and constructs a multimodal high-low frequency chain-of-thought reasoning mechanism. By leveraging a cognitive bank to store intermediate results of cross-modal reasoning and combining a dual-path chain of thought, the approach fully exploits filtered visual information to achieve more accurate episodic few-shot visual question answering. Experiments demonstrate that the proposed method significantly enhances the model's capability in handling Episodic Few-shot Visual Question Answering (EFSVQA) tasks, improves the accuracy and generalization performance of cross-modal reasoning, and outperforms comparable algorithms on public benchmarks.

HTML全文

参考文献(37)

施引文献

资源附件(0)