Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective

Salin E, Farah B, Ayache S, Favre B (2022)

Publication Language: English

Publication Type: Journal article

Publication year: 2022

Journal

Proceedings of the AAAI Conference on Artificial Intelligence Association for the Advancement of Artificial Intelligence

Book Volume: 36

Pages Range: 11248-11257

Issue: 10

Journal Issue: 10

DOI: 10.1609/aaai.v36i10.21375

Abstract

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.

Authors with CRIS profile

Emmanuelle Salin

Involved external institutions

Aix-Marseille University / Aix-Marseille Université

France (FR) Sorbonne Paris North University / Université Sorbonne Paris Nord / Université Paris 13

France (FR)

How to cite

APA:

Salin, E., Farah, B., Ayache, S., & Favre, B. (2022). Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11248-11257. https://doi.org/10.1609/aaai.v36i10.21375

MLA:

Salin, Emmanuelle, et al. "Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 36.10 (2022): 11248-11257.

BibTeX: Download