End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman | CVPR 2020

Oct 14, 2021

In a Nutshell 🥜

Miech et al.1 introduce a new learning approach to learn alignments between text and video. Specifically, the work uses the HowTo100M2 dataset, a large-scale dataset of instructional videos, and learns to automatically align narrations (in text format with temporally localized timestamps using speech-to-text) to videos. A major challenge of the HowTo100M dataset is its misalignment issue, where the authors (same primary author as this paper) estimate that approximately 50% of clip-narration pairs are not correctly aligned. This paper thus proposes the MIL-NCE objective, which is a combination of Multiple Instance Learning (MIL) and Noise Contrastive Estimation (NCE) to specifically tackle the misalignment problem.

Figure 1: The MIL-NCE learning approach.

The MIL-NCE learning approach works as follows and has benefits over standard MIL and NCE alone: By extending MIL, given a video clip, k narrations that occur close in time within the same clip are considered as positive candidates. This allows some degree of tolerance for situations such as when the narrator describes the content of a clip slightly before or after its visual appearance. This contrasts the standard NCE approach which considers only a single positive pair. Using a softmax version of NCE, given a video clip, MIL-NCE selects multiple correct positives and downweights incorrect positives based on a discriminative ratio against negatives. This contrasts the standard MIL approach which selects only one positive and discards the remaining.

The paper then performs ablation studies and evaluations of the learned representations. The ablation studies include comparing against other approaches related to NCE such as max-margin ranking loss and a binary cross-entropy loss and comparing MIL-NCE against other MIL approaches such as Max+NCE (applying max-pool), Attn+NCE (using cross-modal attention weights), and Cat+NCE (concatenating all candidates into one longer narration). More ablation studies include testing the amount of positive and negative samples and the choice of the language model, such as LSTM, GRU, and Transformer. The paper then compares their learned representations against the state-of-the-art on five downstream tasks (action recognition, text-to-video retrieval, action localization, action step localization, and action segmentation) across eight datasets and demonstrates performance gains of their approach over most self-supervised approaches and several fully supervised approaches, even without finetuning.

Some Thoughts 💭

This paper contributes an interesting hybrid objective function of MIL and NCE which is suitable for learning strong text-video representations from large-scale unlabeled and noisy data.
I also enjoy the paper’s experiments section, and in particular, where the paper demonstrates the generalizability of their representation learning technique across benchmark tasks in a variety of domains.

Miech, A., Alayrac, J. B., Smaira, L., Laptev, I., Sivic, J., & Zisserman, A. (2020). End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9879-9889).

Miech, A., Zhukov, D., Alayrac, J. B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2630-2640).

Tiny Papers

Discussion about this post