高级检索

    多频思维链推理的情景式小样本视觉问答方法

    An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought Reasoning

    • 摘要: 现有情景式小样本视觉问答方法多依赖于空间域视觉特征,忽略了频域信息在解耦全局与局部语义中的潜力,导致模型对图像内容的理解层次不足。此外,传统推理机制难以实现跨频率的渐进式交互,限制了跨模态推理能力的进一步提升。针对情景式小样本视觉问答任务,提出了一种基于多频思维链的跨模态推理网络。该方法通过频率域信息对图像内容进行深度过滤,设计自适应高低频分离模块解耦视觉模态中不同层次的语义表达,并构建多模态高低频思维链推理机制,利用认知银行存储跨模态推理中间结果,结合双路径思维链充分挖掘过滤后的视觉信息,以实现更准确的情景式小样本视觉问答。实验表明,该方法显著增强了模型对小样本视觉问答任务的处理能力,提升了跨模态推理的准确性和泛化性能,在公开数据集上的实验结果优于同类算法。

       

      Abstract: To address the challenges of episodic few-shot visual question answering, we propose a Frequency-enhanced Cross-modal Chain of Thought Inference Network. The method adaptively filters image content through frequency-domain information, designs an adaptive high-low frequency separation module to decouple semantic representations at different levels in the visual modality, and constructs a multimodal high-low frequency chain-of-thought reasoning mechanism. By leveraging a cognitive bank to store intermediate results of cross-modal reasoning and combining a dual-path chain of thought, the approach fully exploits filtered visual information to achieve more accurate episodic few-shot visual question answering. Experiments demonstrate that the proposed method significantly enhances the model's capability in handling few-shot VQA tasks, improves the accuracy and generalization performance of cross-modal reasoning, and outperforms comparable algorithms on public benchmarks.

       

    /

    返回文章
    返回