Advanced Search

    WEI Yifan, ZHANG Jing. An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought ReasoningJ. Journal of East China University of Science and Technology. DOI: 10.14135/j.cnki.1006-3080.20250910003
    Citation: WEI Yifan, ZHANG Jing. An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought ReasoningJ. Journal of East China University of Science and Technology. DOI: 10.14135/j.cnki.1006-3080.20250910003

    An Episodic Few-Shot Visual Question Answering Method with Multi-Frequency Cross-Modal Chain-of-Thought Reasoning

    • To address the challenges of episodic few-shot visual question answering, we propose a Frequency-enhanced Cross-modal Chain of Thought Inference Network. The method adaptively filters image content through frequency-domain information, designs an adaptive high-low frequency separation module to decouple semantic representations at different levels in the visual modality, and constructs a multimodal high-low frequency chain-of-thought reasoning mechanism. By leveraging a cognitive bank to store intermediate results of cross-modal reasoning and combining a dual-path chain of thought, the approach fully exploits filtered visual information to achieve more accurate episodic few-shot visual question answering. Experiments demonstrate that the proposed method significantly enhances the model's capability in handling few-shot VQA tasks, improves the accuracy and generalization performance of cross-modal reasoning, and outperforms comparable algorithms on public benchmarks.
    • loading

    Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return