Figure: Video object segmentation using a dictionary of deep visual words. Our proposed method represents an object as a set of cluster centroids in a learned embedding space, or "visual words", which correspond to object parts in image space (bottom row). This representation allows more robust and efficient matching as shown by our results (top row). The visual words are learned in an unsupervised manner, using meta-learning to ensure the training and inference procedures are identical. The t-SNE plot on the right shows how different object parts cluster in different regions of the embedding space, and thus how our representation captures the multi-modal distribution of pixels constituting an object.
Accurate video object segmentation methods finetune a model using the first annotated frame, and/or use additional inputs such as optical flow and complex post-processing. In contrast, we develop a fast algorithm that requires no finetuning, auxiliary inputs or post-processing, and segments a variable number of objects in a single forward-pass. This allows us to robustly match to the reference objects throughout the video, because although the global appearance of an object changes as it undergoes occlusions and deformations, the appearance of more local parts may stay consistent. We learn these visual words in an unsupervised manner, using meta-learning to ensure that our training objective matches our inference procedure. We achieve comparable accuracy to finetuning based methods, and state-of-the-art in terms of speed/accuracy trade-offs on four video segmentation datasets.
Figure: Overview of the proposed method. The first frame of the video (reference frame), which forms the support set S in our meta-learning setup, passes through a deep segmentation network f(θ) to compute a d = 128 dimensional embedding for each pixel. A dictionary of deep visual words are then learned by clustering these embeddings for each object in the reference frame. Pixels of the query frame are classified as one of the objects based to their similarities to the visual words. The model is meta-trained by alternately learning the visual words given model parameters θ, and learning model parameters given the visual words.