LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

Adrian BulatGeorgios Tzimiropoulos

CVPR 2023

Abstract

Soft prompt learning has recently emerged as one of the methods of choice for adapting V&L models to a downstream task using a few training examples. However, current methods significantly overfit the training data, suffer- ing from large accuracy degradation when tested on unseen classes from the same domain. To this end, in this paper, we make the following 4 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we propose grouped LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) We identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) We show that LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.

Video presentation

Publication

@inproceedings{bulat2023lasp,
  title={LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision \& Language Models},
  author={Bulat, Adrian and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23232--23241},
  year={2023}
}

Publications on Vision & Language

  • New VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
    PDF
  • New Compress & Cache: Vision token compression for efficient generation and retrieval

    Advances in Neural Information Processing Systems (NeurIPS), 2025
    PDF
  • New Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

    Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
    PDF Code
  • New VladVA: Discriminative Fine-tuning of LVLMs

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    PDF
  • Efficient Vision-Language pre-training via domain-specific learning for human activities

    Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
    PDF
  • New CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

    European Conference on Computer Vision (ECCV), 2024
    PDF
  • New FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    PDF
  • ReGen: A good Generative zero-shot video classifier should be Rewarded

    International Conference on Computer Vision (ICCV), 2023
    PDF
  • Black Box Few-Shot Adaptation for Vision-Language models

    International Conference on Computer Vision (ICCV), 2023
    PDF Code
  • FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

    International Conference on Computer Vision (ICCV), 2023
    PDF
  • Bayesian Prompt Learning for Image-Language Model Generalization

    International Conference on Computer Vision (ICCV), 2023
    PDF Code
  • LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
    PDF Code