TL;DR:
- DeepMind presents CapPa, an image captioning-based pretraining strategy that can rival CLIP.
- CapPa demonstrates favorable model and data scaling properties, positioning image captioning as a competitive pretraining approach for vision backbones.
- Cap vision backbone surpasses CLIP models in tasks such as few-shot classification, captioning, OCR, and VQA.
- CapPa combines autoregressive prediction with parallel prediction, leveraging Vision Transformer (ViT) encoders and a Transformer decoder architecture.
- CapPa outperforms conventional Cap and achieves comparable performance to CLIP∗ in various downstream tasks.
- CapPa exhibits strong zero-shot capabilities and promising scaling properties.
Main AI News:
In the realm of high-quality vision backbones, Contrastive Language Image Pretraining (CLIP) has long been hailed as a leading strategy, showcasing remarkable zero-shot transfer capabilities and rivaling the performance of top label-supervised approaches. Meanwhile, image captioning, although simplistic in nature, has remained overshadowed by CLIP due to its perceived limitations in zero-shot learning.
However, a recent groundbreaking paper titled “Image Captioners Are Scalable Vision Learners Too” by a research team at DeepMind challenges this notion. The team introduces CapPa, an image captioning-based pretraining strategy that stands toe-to-toe with CLIP, boasting favorable model and data scaling properties. This study illuminates the potential of plain image captioning as a competitive pretraining approach for vision backbones.
The primary objective of this endeavor is to develop a plain image captioning method that matches the simplicity, scalability, and efficiency of the renowned CLIP. To achieve this, the research team embarks on a comprehensive comparison between image captioning, referred to as Captioner (Cap), and the CLIP strategy. They meticulously align the pretraining compute, model capacity, and training data to ensure a fair assessment.
Notably, the researchers observe that the Cap vision backbone outperforms CLIP models in tasks such as few-shot classification, captioning, optical character recognition (OCR), and visual question answering (VQA). Furthermore, Cap exhibits comparable performance to CLIP when transferred to classification tasks with extensive labeled training data, hinting at its superiority in multimodal downstream tasks.
To push the boundaries even further, the researchers introduce the CapPa pretraining procedure—a mixed training strategy that combines standard autoregressive prediction (Cap) with a parallel prediction (Pa). The model architecture adopts the Vision Transformer (ViT) as a vision encoder and utilizes a standard Transformer decoder architecture for predicting image captions. By employing cross-attention, the ViT-encoded sequence is seamlessly fed to the decoder.
During the training stage, the model is no longer restricted to autoregressive training alone. Instead, it is trained to predict all tokens in parallel. In this setting, the decoder can capitalize on image information to enhance prediction accuracy, as it independently predicts all caption tokens in parallel. The integration of image information becomes a crucial asset for the decoder’s performance improvement.
In an extensive empirical study, CapPa is pitted against conventional Cap and the popular state-of-the-art CLIP approach across a wide range of downstream tasks, including image classification, captioning, OCR, and visual question answering. CapPa emerges triumphant over Cap in nearly all tasks, surpassing or achieving comparable performance to CLIP∗ trained with the same batch size. Notably, CapPa showcases robust zero-shot capabilities and promising scaling properties, further solidifying its prowess.
In conclusion, the emergence of CapPa as a competitive image captioning-based pretraining strategy marks a significant development in the market. DeepMind’s research showcases the potential of image captioning to challenge the dominance of CLIP. This breakthrough offers businesses and researchers new avenues for exploring simpler, scalable, and efficient pretraining methods for vision backbones. With CapPa’s superior performance in tasks and its zero-shot capabilities, the market can expect increased innovation and investment in captioning as a pretraining task for vision encoders.
Source: Synced
Source: Synced
Conclusion:
The emergence of CapPa as a competitive image captioning-based pretraining strategy marks a significant development in the market. DeepMind’s research showcases the potential of image captioning to challenge the dominance of CLIP. This breakthrough offers businesses and researchers new avenues for exploring simpler, scalable, and efficient pretraining methods for vision backbones. With CapPa’s superior performance in tasks and its zero-shot capabilities, the market can expect increased innovation and investment in captioning as a pretraining task for vision encoders.