A Memory-based Robust region feature synthesizer for zero-shot object detection


With the goal to detect both the object categories appearing in the training phase and those never have been observed before testing, zero-shot object detection (ZSD) becomes a challenging yet anticipated task in the community. Current approaches tackle this problem by drawing on the feature synthesis techniques used in the zero-shot image classification (ZSC) task without delving into the inherent problems of ZSD. In this paper, we analyze the out-standing challenges that ZSD presents compared with ZSC—severe intra-class variation, complex category co-occurrence, open test scenario, and reveal their interference to the region feature synthesis process. In view of this, we propose a novel memory-based robust region feature synthesizer (M-RRFS) for ZSD, which is equipped with the Intra-class Semantic Diverging (IntraSD), the Inter-class Structure Preserving (InterSP), and the Cross-Domain Contrast Enhancing (CrossCE) mechanisms to overcome the inadequate intra-class diversity, insufficient inter-class separability, and weak inter-domain contrast problems. Moreover, when designing the whole learning framework, we develop an asynchronous memory container (AMC) to explore the cross-domain relationship between the seen class domain and unseen class domain to reduce the overlap between the distributions of them. Based on AMC, a memory-assisted ZSD inference process is also proposed to further boost the prediction accuracy. To evaluate the proposed approach, comprehensive experiments on MS-COCO, PASCAL VOC, ILSVRC and DIOR datasets are conducted, and superior performances have been achieved. Notably, we achieve new state-of-the-art performances on MS-COCO dataset, i.e., 64.0%, 60.9% and 55.5% Recall@100 with IoU = 0.4, 0.5, 0.6 respectively, and 15.1% mAp with IoU=0.5, under the 48/17 category split setting. Meanwhile, experiments on the DIOR dataset actually build the earliest benchmark for evaluating zero-shot object detection performance on remote sensing images.

In International Journal of Computer Vision