Ad-hoc Video Search with Discrete and Continuous Representations

Abstract

We present two different approaches we developed for cross-modal retrieval in the Ad-hoc video search (AVS) task. Our system is a fully-automatic system utilizing no in-domain data nor annotation. We jointly utilize representations in the discrete semantic space learned from multiple mutually exclusive source domains as well as continuous representations in the textual-visual joint-embedding space. We encode textual queries and videos in these spaces and perform search and retrieval. We achieved 12.59 inferred average precision (IAP) on the AVS 2016 validation set and achieved 8.7 IAP and ranked 2nd in the 2018 AVS task.