Abstract:
Zero-shot object detection (ZSD) for identifying and localizing target classes that do not appear in the training data has become a new challenge in computer vision. Despite the current rapid development of ZSD methods, most existing methods are based on strict mapping transfer strategies to recognize invisible class objects. These models ignore the semantic information of invisible classes such that the misclassification phenomenon are yielded, as during the testing process, the detection results are prone to bias towards visible classes. At the same time, due to the similarity in attributes among different categories, the distribution of mapping features is relatively chaotic. To address the above problems, this paper proposes a zero-shot object detection framework based on a top-down attention mechanism. Specifically, a prior knowledge extraction module is constructed to generate prior knowledge related to the final detection task for each proposal. And then, a top-down attention mechanism module is utilized to fuse the mapping features with the prior knowledge, which can provide a task orientation for the detection process and guide the model to pay attention to the possible invisible features during the training, preventing them from being simply classified as background information, and thereby mitigating the domain shift problem. In addition, a new contrast constraint is designed to improve the discriminative and clustering ability between the mapping features. Finally, we conduct extensive experiments on MSCOCO, one standard datasets, from which it is shown that the proposed method can achieve significant effects on both ZSD and generalized ZSD (GZSD) tasks.