Learning Deep Structure-Preserving Image-Text Embeddings
Liwei Wang, Yin Li, Svetlana Lazebnik | CVPR 2016
In a Nutshell 🥜
Wang et al.1 introduce a method for learning image and text correspondence. The method uses two branches of neural networks (one for image and one for text). Each sub-network contains fully connected layers with ReLU activation and followed by L2 normalization. The method then computes the Euclidean distance between the embeddings of the two branches. The network is trained on a stochastic margin-based loss with bidirectional cross-view ranking constraints and within-view structure preserving constraints. On a high level, distances between positive pairs are trained to be smaller than negative pairs by an enforced margin. The loss function is a hinge loss and training data are sampled using triplet sampling.
The paper then performs experiments and ablation studies on their method for image-to-sentence and sentence-to-image retrieval tasks on the Flickr30K and MSCOCO datasets, and for phrase localization on the Flickr30K Entities dataset. The evaluation metrics include Recall@K (K = 1, 5, 10) for image-sentence retrieval, and Recall@K (a region proposal is considered correct if its IOU with the ground truth is greater or equal to 0.5) and mAP for phrase localization. Overall, the experiments demonstrate SotA results with the paper’s method and justify various design decisions on architecture and objective function.
Some Thoughts ðŸ’
The paper’s proposed architecture is similar to Siamese networks. In fact, the author pointed out that Siamese networks can be considered a special case of their proposed architecture. I think the paper’s unique contribution is bringing this architecture into the multimodal domain with modifications, and backing this up with extensive experiments on relevant tasks and datasets.
I liked how clear and reproducible the experimental section is and found the ablation studies in particular to be very insightful.
Wang, L., Li, Y., & Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5005-5013).