Publications

2026
New VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Adrian Bulat*, Alberto Baldrati*, Ioannis Maniadis Metaxas*, Yassine Ouali, Georgios Tzimiropoulos

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions 2026
PDF

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
```
@inproceedings{bulat2026visor,
  title={VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions},
  author={Bulat, Adrian and Baldrati, Alberto and Metaxas, Ioannis Maniadis and Ouali, Yassine and Tzimiropoulos, Georgios},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}
```
2026
New Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

I-Hsiang Chen, Isma Hadji, Enrique Sanchez, Adrian Bulat, Sy-Yen Kuo, Radu Timofte, Georgios Tzimiropoulos, Brais Martinez

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration 2026
PDF Code

Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.
```
@inproceedings{chen2026restore,
  title={Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration},
  author={Chen, I-Hsiang and Hadji, Isma and Sanchez, Enrique and Bulat, Adrian and Kuo, Sy-Yen and Timofte, Radu and Tzimiropoulos, Georgios and Martinez, Brais},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}
```
2025
New Compress & Cache: Vision token compression for efficient generation and retrieval

Adrian Bulat*, Yassine Ouali*, Georgios Tzimiropoulos

Advances in Neural Information Processing Systems (NeurIPS), 2025
Compress & Cache: Vision token compression for efficient generation and retrieval 2025
PDF

This work aims to compress the vision tokens of an LVLM into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) storage-efficient. To this end, we propose C&C, a novel compression method that leverages the LVLM itself for task-agnostic visual token compression. Unlike prior methods that perform token reduction on-the-fly, our approach offloads computation to a dedicated, upfront indexing stage, effectively decoupling compression from generation. This enables learning more powerful representations for generation during inference. At the core of C&C is a "double-forward pass" training strategy. During the first forward pass, the LLM (of the LVLM) creates a bottleneck by compressing the dense visual tokens into a few summary tokens. Subsequently, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training of C&C is guided by two key losses: an autoregressive loss applied after the second pass that provides a direct optimization objective for reconstructing the original information flow, and a contrastive loss applied after the first pass to bolster the representational strength of the summary tokens, particularly for discriminative tasks. Moreover, we propose stage-specific adapters for further enhancing performance. C&C produces highly informative compressed representations. An in-depth ablation study confirms the efficacy of our approach. For generative tasks, we achieve a 2x higher compression rate without compromising capabilities, setting a new state-of-the-art. For discriminative tasks, we establish new state-of-the-art results on image retrieval and compositionality benchmarks.
```
@inproceedings{bulat2025compresscache,
  title={Compress \& Cache: Vision token compression for efficient generation and retrieval},
  author={Bulat, Adrian and Ouali, Yassine and Tzimiropoulos, Georgios},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}
```
2025
New Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, Georgios Tzimiropoulos

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions 2025
PDF Code

Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters.
```
@inproceedings{ntinou2025visionfree,
  title={Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions},
  author={Ntinou, Ioanna and Xenos, Alexandros and Ouali, Yassine and Bulat, Adrian and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2025}
}
```
2025
New VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
VladVA: Discriminative Fine-tuning of LVLMs 2025
PDF

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
```
@inproceedings{ouali2025vladva,
  title={VladVA: Discriminative Fine-tuning of LVLMs},
  author={Ouali, Yassine and Bulat, Adrian and Xenos, Alexandros and Zaganidis, Anestis and Metaxas, Ioannis Maniadis and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={4101--4111},
  year={2025}
}
```
2025
New FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Haosen Yang, Adrian Bulat, Isma Hadji, Hai X. Pham, Xiatian Zhu, Georgios Tzimiropoulos, Brais Martinez

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion 2025
PDF

Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.
```
@inproceedings{yang2025famdiffusion,
  title={FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion},
  author={Yang, Haosen and Bulat, Adrian and Hadji, Isma and Pham, Hai X. and Zhu, Xiatian and Tzimiropoulos, Georgios and Martinez, Brais},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}
```
2024
QBB: Quantization with Binary Bases for LLMs

Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos

Advances in Neural Information Processing Systems (NeurIPS), 2024
QBB: Quantization with Binary Bases for LLMs 2024
PDF

Current post-training quantization methods for LLMs compress the weights down to 4-bits, with moderate to low degradation in accuracy. However, further reducing the number of bits or accelerating the network while avoiding large accuracy drops, especially for smaller, sub 7B models, remains an actively researched and open problem. To address this, in this work, we introduce Quantization with Binary Bases (QBB), a new approach for low-bit quantization that effectively removes (nearly) all multiplications, reducing the implementation to summations. Our novel approach works by decomposing the original weights into a set of binary (1-bit) matrices using an iterative process. For a given layer, starting from a weight matrix, we first construct an initial approximation using an analytical solution, where each new binary matrix, paired with a scaling vector, approximates the residual error of the previous estimation. Secondly, using gradient descent and a progressive learning curriculum, we find the optimal set of binary matrices and scaling vectors that minimize the l2 distance between the produced approximation and original weights. Thirdly, as previous steps are input agnostic, we holistically optimize the scaling vectors alone, calibrating them in student-teacher fashion, with the teacher providing both the data, by autoregressive generation starting from a random token, and the target logits. When evaluated across multiple LLM families, our approach matches and outperforms all prior works, setting a new state-of-the-art result using a summation-only based approach.
```
@inproceedings{bulat2024qbb,
  title={QBB: Quantization with Binary Bases for LLMs},
  author={Bulat, Adrian and Ouali, Yassine and Tzimiropoulos, Georgios},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  volume={37},
  year={2024}
}
```
2024
Efficient Vision-Language pre-training via domain-specific learning for human activities

Adrian Bulat, Yassine Ouali, Ricardo Guerrero, Brais Martinez, Georgios Tzimiropoulos

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Efficient Vision-Language pre-training via domain-specific learning for human activities 2024
PDF

Current Vision-Language (VL) models owe their success to large-scale pre-training on web-collected data, which in turn requires high-capacity architectures and large compute resources for training. We posit that when the downstream tasks are known in advance, which is in practice common, the pretraining process can be aligned to the downstream domain, leading to more efficient and accurate models, while shortening the pretraining step. To this end, we introduce a domain-aligned pretraining strategy that, without additional data collection, improves the accuracy on a domain of interest, herein, that of human activities, while largely preserving the generalist knowledge. At the core of our approach stands a new LLM-based method that, provided with a simple set of concept seeds, produces a concept hierarchy with high coverage of the target domain. The concept hierarchy is used to filter a large-scale web-crawled dataset and, then, enhance the resulting instances with targeted synthetic labels. We study in depth how to train such approaches and their resulting behavior. We further show generalization to video-based data by introducing a fast adaptation approach for transitioning from a static (image) model to a dynamic one (i.e. with temporal modeling). On the domain of interest, our approach significantly outperforms models trained on up to 60x more samples and between 10-100x shorter training schedules for image retrieval, video retrieval and action recognition.
```
@inproceedings{bulat2024efficient,
  title={Efficient Vision-Language pre-training via domain-specific learning for human activities},
  author={Bulat, Adrian and Ouali, Yassine and Guerrero, Ricardo and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={7978--8000},
  year={2024}
}
```
2024
Knowledge Distillation Meets Open-Set Semi-Supervised Learning

Jing Yang, Xiatian Zhu, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

International Journal of Computer Vision (IJCV), 2024
Knowledge Distillation Meets Open-Set Semi-Supervised Learning 2024
PDF Code

Existing knowledge distillation methods mostly focus on distillation of teacher's prediction and intermediate activation. However, the structured representation, which arguably is one of the most critical ingredients of deep models, is largely overlooked. In this work, we propose a novel method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. The key idea is that we leverage the teacher's classifier as a semantic critic for evaluating the representations of both teacher and student and distilling the semantic knowledge with high-order structured information over all feature dimensions. This is accomplished by introducing a notion of cross-network logit computed through passing student's representation into teacher's classifier. Further, considering the set of seen classes as a basis for the semantic space in a combinatorial perspective, we scale the method to unseen classes for enabling effective exploitation of largely available, arbitrary unlabeled training data. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL). Extensive experiments show that our method outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks, as well as less studied yet practically crucial binary network distillation. Under more realistic open-set SSL settings we introduce, we reveal that knowledge distillation is generally more effective than existing Out-Of-Distribution (OOD) sample detection, and our proposed method is superior over both previous distillation and SSL competitors.
```
@article{yang2024knowledge,
  title={Knowledge Distillation Meets Open-Set Semi-Supervised Learning},
  author={Yang, Jing and Zhu, Xiatian and Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios},
  journal={International Journal of Computer Vision (IJCV)},
  year={2024},
  publisher={Springer}
}
```
2024
New CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

European Conference on Computer Vision (ECCV), 2024
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs 2024
PDF

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.
```
@article{ouali2024clip,
  title={CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs},
  author={Ouali, Yassine and Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}
```
2024
New You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

European Conference on Computer Vision (ECCV), 2024
You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation 2024
PDF

In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.
```
@article{noroozi2024you,
  title={You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation},
  author={Noroozi, Mehdi and Hadji, Isma and Martinez, Brais and Bulat, Adrian and Tzimiropoulos, Georgios},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}
```
2024
New FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models 2024
PDF

Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition (+6% on average over 11 datasets) and image retrieval (+19% on Flickr30k and +15% on MSCOCO).
```
@article{bulat2024fff,
  title={FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models},
  author={Bulat, Adrian and Ouali, Yassine and Tzimiropoulos, Georgios},
  journal={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}
```
2023
ReGen: A good Generative zero-shot video classifier should be Rewarded

Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2023
ReGen: A good Generative zero-shot video classifier should be Rewarded 2023
PDF

This paper sets out to solve the following problem: How can we turn a generative video captioning model into an open-world video/action classification model? Video captioning models can naturally produce open-ended free-form descriptions of a given video which, however, might not be discriminative enough for video/action recognition. Unfortunately, when fine-tuned to auto-regress the class names directly, video captioning models overfit the base classes losing their open-world zero-shot capabilities. To alleviate base class overfitting, in this work, we propose to use reinforcement learning to enforce the output of the video captioning model to be more class-level discriminative. Specifically, we propose ReGen, a novel reinforcement learning based framework with a three-fold objective and reward functions:(1) a class-level discrimination reward that enforces the generated caption to be correctly classified into the corresponding action class,(2) a CLIP reward that encourages the generated caption to continue to be descriptive of the input video (ie video-specific), and (3) a grammar reward that preserves the grammatical correctness of the caption. We show that ReGen can train a model to produce captions that are: discriminative, video-specific and grammatically correct. Importantly, when evaluated on standard benchmarks for zero-and few-shot action classification, ReGen significantly outperforms the previous state-of-the-art.
```
@inproceedings{bulat2023regen,
  title={ReGen: A good Generative zero-shot video classifier should be Rewarded},
  author={Bulat, Adrian and Sanchez, Enrique and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={13523--13533},
  year={2023}
}
```
2023
Black Box Few-Shot Adaptation for Vision-Language models

Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2023
Black Box Few-Shot Adaptation for Vision-Language models 2023
PDF Code

Vision-Language (VL) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for VL few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights,(b) it is orders of magnitude faster at training time,(c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for VL re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.
```
@inproceedings{ouali2023black,
  title={Black Box Few-Shot Adaptation for Vision-Language models},
  author={Ouali, Yassine and Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
  year={2023}
}
```
2023
FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

Adrian Bulat, Ricardo Guerrero, Brais Martinez, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2023
FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training 2023
PDF

This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata:(a) it must be used as is, without requiring any fine-tuning at test time,(b) it must be able to process an arbitrary number of novel objects concurrently while supporting an arbitrary number of examples from each class and (c) it must achieve accuracy comparable to a closed system. Towards satisfying (a)-(c), in this work, we make the following contributions: We introduce, for the first time, a simple, yet powerful, few-shot detection transformer (FS-DETR) based on visual prompting that can address both desiderata (a) and (b). Our system builds upon the DETR framework, extending it based on two key ideas:(1) feed the provided visual templates of the novel classes as visual prompts during test time, and (2)" stamp" these prompts with pseudo-class embeddings (akin to soft prompting), which are then predicted at the output of the decoder. Importantly, we show that our system is not only more flexible than existing methods, but also, it makes a step towards satisfying desideratum (c). Specifically, it is significantly more accurate than all methods that do not require fine-tuning and even matches and outperforms the current state-of-the-art fine-tuning based methods on the most well-established benchmarks (PASCAL VOC & MSCOCO).
```
@inproceedings{bulat2023fs,
  title={FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training},
  author={Bulat, Adrian and Guerrero, Ricardo and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
  year={2023}
}
```
2023
Bayesian Prompt Learning for Image-Language Model Generalization

Mohammad M. Derakhshani, Enrique Sanchez, Adrian Bulat, Victor G Turrisi da Costa, Cees GM Snoek, Georgios Tzimiropoulos, Brais Martinez

International Conference on Computer Vision (ICCV), 2023
Bayesian Prompt Learning for Image-Language Model Generalization 2023
PDF Code

Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains.
```
@inproceedings{derakhshani2023bayesian,
  title={Bayesian Prompt Learning for Image-Language Model Generalization},
  author={Derakhshani, Mohammad Mahdi and Sanchez, Enrique and Bulat, Adrian and da Costa, Victor G Turrisi and Snoek, Cees GM and Tzimiropoulos, Georgios and Martinez, Brais},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={15237--15246},
  year={2023}
}
```
2023
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

Adrian Bulat, Georgios Tzimiropoulos

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models 2023
PDF Code

Soft prompt learning has recently emerged as one of the methods of choice for adapting V&L models to a downstream task using a few training examples. However, current methods significantly overfit the training data, suffering from large accuracy degradation when tested on unseen classes from the same domain. To this end, in this paper, we make the following 4 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we propose grouped LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) We identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) We show that LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
```
@inproceedings{bulat2023language,
  title={LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models},
  author={Bulat, Adrian and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}
```
2022
Pre-training strategies and datasets for facial representation learning

Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, Georgios Tzimiropoulos

European Conference on Computer Vision (ECCV), 2022
Pre-training strategies and datasets for facial representation learning 2022
PDF Code

What is the best way to learn a universal face representation? Recent work on Deep Learning in the area of face analysis has focused on supervised learning for specific tasks of interest (e.g. face recognition, facial landmark localization etc.) but has overlooked the overarching question of how to find a facial representation that can be readily adapted to several facial analysis tasks and datasets. To this end, we make the following 4 contributions: (a) we introduce, for the first time, a comprehensive evaluation benchmark for facial representation learning consisting of 5 important face analysis tasks. (b) We systematically investigate two ways of large-scale representation learning applied to faces: supervised and unsupervised pre-training. Importantly, we focus our evaluations on the case of few-shot facial learning. (c) We investigate important properties of the training datasets including their size and quality (labelled, unlabelled or even uncurated). (d) To draw our conclusions, we conducted a very large number of experiments. Our main two findings are: (1) Unsupervised pre-training on completely in-thewild, uncurated data provides consistent and, in some cases, significant accuracy improvements for all facial tasks considered. (2) Many existing facial video datasets seem to have a large amount of redundancy. We will release code, pre-trained models and data to facilitate future research.
```
@inproceedings{bulat2022pre,
  title={Pre-training strategies and datasets for facial representation learning},
  author={Bulat, Adrian and Cheng, Shiyang and Yang, Jing and Garbett, Andrew and Sanchez, Enrique and Tzimiropoulos, Georgios},
  journal={ECCV},
  year={2022}
}
```
2022
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, Brais Martinez

European Conference on Computer Vision (ECCV), 2022
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers 2022
PDF

Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever-higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several successful design choices (e.g., the convolutions and hierarchical multi-stage structure) of prior CNNs have been reintroduced into recent ViTs, they are still not sufficient to meet the limited resource requirements of mobile devices. This motivates a very recent attempt to develop light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a performance gap behind. In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency. This is realized by introducing a highly cost-effective local-global-local (LGL) information exchange bottleneck based on optimal integration of self-attention and convolutions. For device-dedicated evaluation, rather than relying on inaccurate proxies like the number of FLOPs or parameters, we adopt a practical approach of focusing directly on on-device latency and, for the first time, energy efficiency. Specifically, we show that our models are Pareto-optimal when both accuracy-latency and accuracy-energy trade-offs are considered, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs.
```
@inproceedings{pan2022edgevits,
  title={EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers},
  author={Pan, Junting and Bulat, Adrian and Tan, Fuwen and Zhu, Xiatian and Dudziak, Lukasz and Li, Hongsheng and Tzimiropoulos, Georgios and Martinez, Brais},
 booktitle={arXiv preprint arXiv:2205.03436},
  year={2022}
}
```
2021
Space-time Mixing Attention for Video Transformer

Adrian Bulat, JM Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos

Advances in Neural Information Processing Systems (NeurIPS), 2021
Space-time Mixing Attention for Video Transformer 2021
PDF Code

This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models.
```
@inproceedings{bulat2021space,
  title={Space-time Mixing Attention for Video Transformer},
  author={Bulat, Adrian and Perez-Rua, Juan-Manuel and Sudhakaran, Swathikiran and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2021}
}
```
2021
Bit-Mixer: Mixed-precision networks with runtime bit-width selection

Adrian Bulat, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2021
Bit-Mixer: Mixed-precision networks with runtime bit-width selection 2021
PDF

Mixed-precision networks allow for a variable bit-width quantization for every layer in the network. A major limitation of existing work is that the bit-width for each layer must be predefined during training time. This allows little flexibility if the characteristics of the device on which the network is deployed change during runtime. In this work, we propose Bit-Mixer, the very first method to train a meta-quantized network where during test time any layer can change its bid-width without affecting at all the overall network's ability for highly accurate inference. To this end, we make 2 key contributions: (a) Transitional Batch-Norms, and (b) a 3-stage optimization process which is shown capable of training such a network. We show that our method can result in mixed precision networks that exhibit the desirable flexibility properties for on-device deployment without compromising accuracy.
```
@inproceedings{bulat2021bit,
  title={Bit-Mixer: Mixed-precision networks with runtime bit-width selection},
  author={Bulat, Adrian and Tzimiropoulos, Georgios},
  booktitle={International Conference on Computer Vision},
  year={2021}
}
```
2021
High-Capacity Expert Binary Networks

Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

International Conference on Learning Representations (ICLR), 2021
High-Capacity Expert Binary Networks 2021
PDF Code

Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between binary models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network growth mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost, by ~6%, reaching a groundbreaking ~71% on ImageNet classification.
```
@inproceedings{bulat2021high,
  title={High-Capacity Expert Binary Networks},
  author={Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2021}
}
```
2021
Improving memory banks for unsupervised learning with large mini-batch,consistency and hard negative mining

Adrian Bulat, Enrique Sanchez, Georgios Tzimiropoulos

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
Improving memory banks for unsupervised learning with large mini-batch,consistency and hard negative mining 2021
PDF

An important component of unsupervised learning by instance-based discrimination is a memory bank for storing a feature representation for each training sample in the dataset. In this paper, we introduce 3 improvements to the vanilla memory bank-based formulation which brings massive accuracy gains: (a) Large mini-batch: we pull multiple augmentations for each sample within the same batch and show that this leads to better models and enhanced memory bank updates. (b) Consistency: we enforce the logits obtained by different augmentations of the same sample to be close without trying to enforce discrimination with respect to negative samples as proposed by previous approaches. (c) Hard negative mining: since instance discrimination is not meaningful for samples that are too visually similar, we devise a novel nearest neighbour approach for improving the memory bank that gradually merges extremely similar data samples that were previously forced to be apart by the instance level classification loss. Overall, our approach greatly improves the vanilla memory-bank based instance discrimination and outperforms all existing methods for both seen and unseen testing categories with cosine similarity.
```
@inproceedings{bulat2021improving,
  title={Improving memory banks for unsupervised learning with large mini-batch,consistency and hard negative mining},
  author={Bulat, Adrian and Sanchez, Enrique and Tzimiropoulos, Georgios},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021}
}
```
2021
Knowledge Distillation via Softmax Regression Representation Learning

Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

International Conference on Learning Representations (ICLR), 2021
Knowledge Distillation via Softmax Regression Representation Learning 2021
PDF Code

This paper addresses the problem of model compression via knowledge distillation. We advocate for a method that optimizes the output feature of the penultimate layer of the student network and hence is directly related to representation learning. Previous distillation methods which typically impose direct feature matching between the student and the teacher do not take into account the classification problem at hand. On the contrary, our distillation method decouples representation learning and classification and utilizes the teacher's pre-trained classifier to train the student's penultimate layer feature. In particular, for the same input image, we wish the teacher's and student's feature to produce the same output when passed through the teacher's classifier which is achieved with a simple L_2 loss. Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains.
```
@inproceedings{yang2021knowledge,
  title={Knowledge distillation via softmax regression representation learning},
  author={Yang, J and Martinez, B and Bulat, A and Tzimiropoulos, G and others},
  year={2021},
  organization={International Conference on Learning Representations (ICLR)}
}
```
2021
Subpixel Heatmap Regression for Facial Landmark Localization

Adrian Bulat, Enrique Sanchez, Georgios Tzimiropoulos

British Machine Vision Conference (BMVC), 2021
Subpixel Heatmap Regression for Facial Landmark Localization 2021
PDF Code

Deep Learning models based on heatmap regression have revolutionized the task of facial landmark localization with existing models working robustly under large poses, non-uniform illumination and shadows, occlusions and self-occlusions, low resolution and blur. However, despite their wide adoption, heatmap regression approaches suffer from discretization-induced errors related to both the heatmap encoding and decoding process. In this work we show that these errors have a surprisingly large negative impact on facial alignment accuracy. To alleviate this problem, we propose a new approach for the heatmap encoding and decoding process by leveraging the underlying continuous distribution. To take full advantage of the newly proposed encoding-decoding mechanism, we also introduce a Siamese-based training that enforces heatmap consistency across various geometric image transformations. Our approach offers noticeable gains across multiple datasets setting a new state-of-the-art result in facial landmark localization.
```
@inproceedings{bulat2021subpixel,
  title={Subpixel Heatmap Regression for Facial Landmark Localization},
  author={Bulat, Adrian and Sanchez, Enrique and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
  year={2021}
}
```

Estimation of continuous valence and arousal levels from faces in naturalistic conditions

Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, Maja Pantic

Nature Machine Inteligence, 2021

PDF

@article{toisoul2021estimation,
  author  = {Antoine Toisoul and Jean Kossaifi and Adrian Bulat and Georgios Tzimiropoulos and Maja Pantic},
  title   = {Estimation of continuous valence and arousal levels from faces in naturalistic conditions},
  journal = {Nature Machine Intelligence},
  year    = {2021},
  url     = {https://www.nature.com/articles/s42256-020-00280-0}
}

2020
BATS: Binary ArchitecTure Search

Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

European Conference on Computer Vision (ECCV), 2020
BATS: Binary ArchitecTure Search 2020
PDF

This paper proposes Binary ArchitecTure Search (BATS), a framework that drastically reduces the accuracy gap between binary neural networks and their real-valued counterparts by means of Neural Architecture Search (NAS). We show that directly applying NAS to the binary domain provides very poor results. To alleviate this, we describe, to our knowledge, for the first time, the 3 key ingredients for successfully applying NAS to the binary domain. Specifically, we (1) introduce and design a novel binary-oriented search space, (2) propose a new mechanism for controlling and stabilising the resulting searched topologies, (3) propose and validate a series of new search strategies for binary networks that lead to faster convergence and lower search times. Experimental results demonstrate the effectiveness of the proposed approach and the necessity of searching in the binary space directly. Moreover, (4) we set a new state-of-the-art for binary neural networks on CIFAR10, CIFAR100 and ImageNet datasets.
```
@inproceedings{bulat2020bats,
  title={BATS: Binary ArchitecTure Search},
  author={Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2020}
}
```
2020
Training binary neural networks with real-to-binary convolutions

Brais Marinez, Jing Yang*, Adrian Bulat*, Georgios Tzimiropoulos

International Conference on Learning Representations (ICLR), 2020
Training binary neural networks with real-to-binary convolutions 2020
PDF

This paper shows how to train binary networks to within a few percent points (~3-5%) of the full precision counterpart. We first show how to build a strong baseline, which already achieves state-of-the-art accuracy, by combining recently proposed advances and carefully adjusting the optimization procedure. Secondly, we show that by attempting to minimize the discrepancy between the output of the binary and the corresponding real-valued convolution, additional significant accuracy gains can be obtained. We materialize this idea in two complementary ways: (1) with a loss function, during training, by matching the spatial attention maps computed at the output of the binary and real-valued convolutions, and (2) in a data-driven manner, by using the real-valued activations, available during inference prior to the binarization process, for re-scaling the activations right after the binary convolution. Finally, we show that, when putting all of our improvements together, the proposed model beats the current state of the art by more than 5% top-1 accuracy on ImageNet and reduces the gap to its real-valued counterpart to less than 3% and 5% top-1 accuracy on CIFAR-100 and ImageNet respectively when using a ResNet-18 architecture.
```
@incollection{martinez2020training,
  title={Training binary neural networks with real-to-binary convolutions},
  author={Martinez, Brais and Yang, Jing and Bulat, Adrian and Tzimiropoulos, Georgios},
  booktitle={ICLR},
  year={2020}
}
```
2020
Toward fast and accurate human pose estimation via soft-gated skip connections

Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, Maja Pantic

IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2020 (ORAL)
Toward fast and accurate human pose estimation via soft-gated skip connections 2020
PDF

This paper is on highly accurate and highly efficient human pose estimation. Recent works based on Fully Convolutional Networks (FCNs) have demonstrated excellent results for this difficult problem. While residual connections within FCNs have proved to be quintessential for achieving high accuracy, we re-analyze this design choice in the context of improving both the accuracy and the efficiency over the state-of-the-art. In particular, we make the following contributions: (a) We propose gated skip connections with per-channel learnable parameters to control the data flow for each channel within the module within the macro-module. (b) We introduce a hybrid network that combines the HourGlass and U-Net architectures which minimizes the number of identity connections within the network and increases the performance for the same parameter budget. Our model achieves state-of-the-art results on the MPII and LSP datasets. In addition, with a reduction of 3x in model size and complexity, we show no decrease in performance when compared to the original HourGlass network.
```
@inproceedings{bulat2020toward,
  title={Toward fast and accurate human pose estimation via soft-gated skip connections},
  author={Bulat, Adrian and Kossaifi, Jean and Tzimiropoulos, Georgios and Pantic, Maja},
  booktitle={2020 15th IEEE International Conference on Automatic Face \& Gesture Recognition},
  year={2020}
}
```
2020
Semi-supervised AU Intensity Estimation with Contrastive Learning

Enrique Sanchez, Adrian Bulat, Anestis Zaganidis, Georgios Tzimiropoulos

Asian Conference on Computer Vision (ACCV), 2020
Semi-supervised AU Intensity Estimation with Contrastive Learning 2020
PDF

This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as 2% of annotated frames, which are randomly chosen. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as 2% of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.
```
@inproceedings{sanchez2020semi,
  title={Semi-supervised AU Intensity Estimation with Contrastive Learning},
  author={Sanchez, Enrique and Bulat, Adrian and Zaganidis, Anestis and Tzimiropoulos, Georgios},
  booktitle={Proceedings of the Asian Conference on Computer Vision (ACCV)},
  year={2020}
}
```
2020
Incremental multi-domain learning with network latent tensor factorization

Adrian Bulat*, Jean Kossaifi*, Georgios Tzimiropoulos, Maja Pantic

AAAI Conference on Artificial Intelligence, 2020
Incremental multi-domain learning with network latent tensor factorization 2020
PDF

The prominence of deep learning, large amount of annotated data and increasingly powerful hardware made it possible to reach remarkable performance for supervised classification tasks, in many cases saturating the training sets. However the resulting models are specialized to a single very specific task and domain. Adapting the learned classification to new domains is a hard problem due to at least three reasons: (1) the new domains and the tasks might be drastically different; (2) there might be very limited amount of annotated data on the new domain and (3) full training of a new model for each new task is prohibitive in terms of computation and memory, due to the sheer number of parameters of deep CNNs. In this paper, we present a method to learn new-domains and tasks incrementally, building on prior knowledge from already learned tasks and without catastrophic forgetting. We do so by jointly parametrizing weights across layers using low-rank Tucker structure. The core is task agnostic while a set of task specific factors are learnt on each new domain. We show that leveraging tensor structure enables better performance than simply using matrix operations. Joint tensor modelling also naturally leverages correlations across different layers. Compared with previous methods which have focused on adapting each layer separately, our approach results in more compact representations for each new task/domain. We apply the proposed method to the 10 datasets of the Visual Decathlon Challenge and show that our method offers on average about 7.5x reduction in number of parameters and competitive performance in terms of both classification accuracy and Decathlon score.
```
@inprocessings{bulat2020incremental,
  title={Incremental multi-domain learning with network latent tensor factorization},
  author={Bulat, Adrian and Kossaifi, Jean and Tzimiropoulos, Georgios and Pantic, Maja},
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2020}
}
```
2020
FAN-Face: a Simple Orthogonal Improvement to Deep Face Recognition

Jing Yang, Adrian Bulat, Georgios Tzimiropoulos

AAAI Conference on Artificial Intelligence, 2020
FAN-Face: a Simple Orthogonal Improvement to Deep Face Recognition 2020
PDF

It is known that facial landmarks provide pose, expression and shape information. In addition, when matching, for example, a profile and/or expressive face to a frontal one, knowledge of these landmarks is useful for establishing correspondence which can help improve recognition. However, in prior work on face recognition, facial landmarks are only used for face cropping in order to remove scale, rotation and translation variations. This paper proposes a simple approach to face recognition which gradually integrates features from different layers of a facial landmark localization network into different layers of the recognition network. To this end, we propose an appropriate feature integration layer which makes the features compatible before integration. We show that such a simple approach systematically improves recognition on the most difficult face recognition datasets, setting a new state-of-the-art on IJB-B, IJB-C and MegaFace datasets.
```
@inprocessings{yang2020fan,
  title={FAN-Face: a Simple Orthogonal Improvement to Deep Face Recognition},
  author={Yang, Jing and Bulat, Adrian and Tzimiropoulos, Georgios},
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2020}
}
```
2020
Factorized Higher-Order CNNs with an Application to Spatio-Temporal Emotion Estimation

Jean Kossaifi*, Antoine Toisoul*, Adrian Bulat, Yannis Panagakis, Timothy M Hospedales, Maja Pantic

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Factorized Higher-Order CNNs with an Application to Spatio-Temporal Emotion Estimation 2020
PDF

Training deep neural networks with spatio-temporal (i.e., 3D) or multidimensional convolutions of higher-order is computationally challenging due to millions of unknown parameters across dozens of layers. To alleviate this, one approach is to apply low-rank tensor decompositions to convolution kernels in order to compress the network and reduce its number of parameters. Alternatively, new convolutional blocks, such as MobileNet, can be directly designed for efficiency. In this paper, we unify these two approaches by proposing a tensor factorization framework for efficient multidimensional (separable) convolutions of higher-order. Interestingly, the proposed framework enables a novel higher-order transduction, allowing to train a network on a given domain (e.g., 2D images or N-dimensional data in general) and using transduction to generalize to higher-order data such as videos (or (N+K)-dimensional data in general), capturing for instance temporal dynamics while preserving the learnt spatial information. We apply the proposed methodology, coined CP-Higher-Order Convolution (HO-CPConv), to spatio-temporal facial emotion analysis. Most existing facial affect models focus on static imagery and discard all temporal information. This is due to the above-mentioned burden of training 3D convolutional nets and the lack of large bodies of video data annotated by experts. We address both issues with our proposed framework. Initial training is first done on static imagery before using transduction to generalize to the temporal domain. We demonstrate superior performance on three challenging large scale affect estimation datasets, AffectNet, SEWA, and AFEW-VA.
```
@inproceedings{kossaifi2020factorized,
  title={Factorized Higher-Order CNNs with an Application to Spatio-Temporal Emotion Estimation},
  author={Kossaifi, Jean and Toisoul, Antoine and Bulat, Adrian and Panagakis, Yannis and Hospedales, Timothy M and Pantic, Maja},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={6060--6069},
  year={2020}
}
```
2019
XNOR-Net++: Improved binary neural networks

Adrian Bulat, Georgios Tzimiropoulos

British Machine Vision Conference (BMVC), 2019
XNOR-Net++: Improved binary neural networks 2019
PDF Code

This paper proposes an improved training algorithm for binary neural networks in which both weights and activations are binary numbers. A key but fairly overlooked feature of the current state-of-the-art method of XNOR-Net is the use of analytically calculated real-valued scaling factors for re-weighting the output of binary convolutions. We argue that analytic calculation of these factors is sub-optimal. Instead, in this work, we make the following contributions: (a) we propose to fuse the activation and weight scaling factors into a single one that is learned discriminatively via backpropagation. (b) More importantly, we explore several ways of constructing the shape of the scale factors while keeping the computational budget fixed. (c) We empirically measure the accuracy of our approximations and show that they are significantly more accurate than the analytically calculated one. (d) We show that our approach significantly outperforms XNOR-Net within the same computational budget when tested on the challenging task of ImageNet classification, offering up to 6% accuracy gain.
```
@inprocessings{bulat2019xnor-net-plus,
  title={XNOR-Net++: Improved binary neural networks},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    booktitle={British Machine Vision Conference},
    year={2016}
}
```
2019
T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor

Jean Kossaifi*, Adrian Bulat*, Georgios Tzimiropoulos, Maja Pantic

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor 2019
PDF

Recent findings indicate that over-parametrization, while crucial for successfully training deep neural networks, also introduces large amounts of redundancy. Tensor methods have the potential to efficiently parametrize over-complete representations by leveraging this redundancy. In this paper, we propose to fully parametrize Convolutional Neural Networks (CNNs) with a single high-order, low-rank tensor. Previous works on network tensorization have focused on parametrizing individual layers (convolutional or fully connected) only, and perform the tensorization layer-by-layer separately. In contrast, we propose to jointly capture the full structure of a neural network by parametrizing it with a single high-order tensor, the modes of which represent each of the architectural design parameters of the network (e.g. number of convolutional blocks, depth, number of stacks, input features, etc). This parametrization allows to regularize the whole network and drastically reduce the number of parameters. Our model is end-to-end trainable and the low-rank structure imposed on the weight tensor acts as an implicit regularization. We study the case of networks with rich structure, namely Fully Convolutional Networks (FCNs), which we propose to parametrize with a single 8th-order tensor. We show that our approach can achieve superior performance with small compression rates, and attain high compression rates with negligible drop in accuracy for the challenging task of human pose estimation.
```
@article{kossaifi2019t,
  title={T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor},
  author={Kossaifi, Jean and Bulat, Adrian and Tzimiropoulos, Georgios and Pantic, Maja},
  journal={arXiv preprint arXiv:1904.02698},
  year={2019}
}
```
2018
To learn image super-resolution, use a GAN to learn how to do image degradation first

Adrian Bulat*, Jing Yang*, Georgios Tzimiropoulos

European Conference on Computer Vision (ECCV), 2018
To learn image super-resolution, use a GAN to learn how to do image degradation first 2018
PDF Code

This paper is on image and face super-resolution. The vast majority of prior work for this problem focus on how to increase the resolution of low-resolution images which are artificially generated by simple bilinear down-sampling (or in a few cases by blurring followed by down-sampling). We show that such methods fail to produce good results when applied to real-world low-resolution, low quality images. To circumvent this problem, we propose a two-stage process which firstly trains a High-to-Low Generative Adversarial Network (GAN) to learn how to degrade and downsample high-resolution images requiring, during training, only unpaired high and low-resolution images. Once this is achieved, the output of this network is used to train a Low-to-High GAN for image super-resolution using this time paired low- and high-resolution images. Our main result is that this network can be now used to efectively increase the quality of real-world low-resolution images. We have applied the proposed pipeline for the problem of face super-resolution where we report large improvement over baselines and prior work although the proposed method is potentially applicable to other object categories.
```
@article{bulatyang2018learn,
     title={To learn image super-resolution, use a GAN to learn how to do image degradation first},
     author={Bulat, Adrian and Yang, Jing and Tzimiropoulos, Georgios},
     journal={European Conference on Computer Vision},
     year={2018}
 }
```
2018
Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs

Adrian Bulat, Georgios Tzimiropoulos

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 (SPOTLIGHT)
Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs 2018
PDF

This paper addresses 2 challenging tasks: improving the quality of low resolution facial images and accurately locating the facial landmarks on such poor resolution images. To this end, we make the following 5 contributions: (a) we propose Super-FAN: the very first end-to-end system that addresses both tasks simultaneously, i.e. both improves face resolution and detects the facial landmarks. The novelty or Super-FAN lies in incorporating structural information in a GAN-based super-resolution algorithm via integrating a sub-network for face alignment through heatmap regression and optimizing a novel heatmap loss. (b) We illustrate the benefit of training the two networks jointly by reporting good results not only on frontal images (as in prior work) but on the whole spectrum of facial poses, and not only on synthetic low resolution images (as in prior work) but also on real-world images. (c) We improve upon the state-of-the-art in face super-resolution by proposing a new residual-based architecture. (d) Quantitatively, we show large improvement over the state-of-the-art for both face super-resolution and alignment. (e) Qualitatively, we show for the first time good results on real-world low resolution images.
```
@article{bulat2017super,
    title={Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    journal={arXiv preprint arXiv:1712.02765},
    year={2017}
}
```
2018
Hierarchical binary CNNs for landmark localization with limited resources

Adrian Bulat, Georgios Tzimiropoulos

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018 - Best of ICCV17 SI
Hierarchical binary CNNs for landmark localization with limited resources 2018
PDF Code

Our goal is to design architectures that retain the groundbreaking performance of Convolutional Neural Networks (CNNs) for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tasks, namely human pose estimation and face alignment. We exhaustively evaluate various design choices, identify performance bottlenecks, and more importantly propose multiple orthogonal ways to boost performance. (b) Based on our analysis, we propose a novel hierarchical, parallel and multi-scale residual architecture that yields large performance improvement over the standard bottleneck block while having the same number of parameters, thus bridging the gap between the original network and its binarized counterpart. (c) We perform a large number of ablation studies that shed light on the properties and the performance of the proposed block. (d) We present results for experiments on the most challenging datasets for human pose estimation and face alignment, reporting in many cases state-of-the-art performance. (e) We further provide additional results for the problem of facial part segmentation.
```
@article{bulat2018hierarchical,
     title={Hierarchical binary CNNs for landmark localization with limited resources},
     author={Bulat, Adrian and Tzimiropoulos, Georgios},
     journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
     year={2018}
 }
```
2017
Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression

Aaron S. Jackson, Adrian Bulat, Vasileios Argyriou, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2017
Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression 2017
PDF

3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the availability of multiple facial images (sometimes from the same subject) as input, and must address a number of methodological challenges such as establishing dense correspondences across large facial poses, expressions, and non-uniform illumination. In general these methods require complex and inefficient pipelines for model building and fitting. In this work, we propose to address many of these limitations by training a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. Our CNN works with just a single 2D facial image, does not require accurate alignment nor establishes dense correspondence between images, works for arbitrary facial poses and expressions, and can be used to reconstruct the whole 3D facial geometry (including the non-visible parts of the face) bypassing the construction (during training) and fitting (during testing) of a 3D Morphable Model. We achieve this via a simple CNN architecture that performs direct regression of a volumetric representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related task of facial landmark localization can be incorporated into the proposed framework and help improve reconstruction quality, especially for the cases of large poses and facial expressions.
```
@inproceedings{jackson2017large,
    title={Large pose 3D face reconstruction from a single image via direct volumetric CNN regression},
    author={Jackson, Aaron S and Bulat, Adrian and Argyriou, Vasileios and Tzimiropoulos, Georgios},
    booktitle={Computer Vision (ICCV), 2017 IEEE International Conference on},
    pages={1031--1039},
    year={2017},
    organization={IEEE}
}
```
2017
How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks)

Adrian Bulat, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2017
How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks) 2017
PDF Code

This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make the following 5 contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very large yet synthetically expanded 2D facial landmark dataset and finally evaluate it on all other 2D facial landmark datasets. (b) We create a guided by 2D landmarks network which converts 2D landmark annotations to 3D and unifies all existing datasets, leading to the creation of LS3D-W, the largest and most challenging 3D facial landmark dataset to date ~230,000 images. (c) Following that, we train a neural network for 3D face alignment and evaluate it on the newly introduced LS3D-W. (d) We further look into the effect of all "traditional" factors affecting face alignment performance like large pose, initialization and resolution, and introduce a "new" one, namely the size of the network. (e) We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used.
```
@inproceedings{bulat2017far,
    title={How far are we from solving the 2d \& 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    booktitle={International Conference on Computer Vision},
    volume={1},
    number={6},
    pages={8},
    year={2017}
}
```
2017
Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources

Adrian Bulat, Georgios Tzimiropoulos

International Conference on Computer Vision (ICCV), 2017 (ORAL)
Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources 2017
PDF Code

Our goal is to design architectures that retain the groundbreaking performance of CNNs for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tasks, namely human pose estimation and face alignment. We exhaustively evaluate various design choices, identify performance bottlenecks, and more importantly propose multiple orthogonal ways to boost performance. (b) Based on our analysis, we propose a novel hierarchical, parallel and multi-scale residual architecture that yields large performance improvement over the standard bottleneck block while having the same number of parameters, thus bridging the gap between the original network and its binarized counterpart. (c) We perform a large number of ablation studies that shed light on the properties and the performance of the proposed block. (d) We present results for experiments on the most challenging datasets for human pose estimation and face alignment, reporting in many cases state-of-the-art performance.
```
@inproceedings{bulat2017binarized,
    title={Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    booktitle={The IEEE International Conference on Computer Vision (ICCV)},
    volume={1},
    number={2},
    pages={4},
    year={2017}
}
```
2016
Two-stage convolutional part heatmap regression for the 1st 3d face alignment in the wild (3DFAW) challenge

Adrian Bulat, Georgios Tzimiropoulos

European Conference on Computer Vision Workshop (ECCV-W), 2016 (Challenge Winners)
Two-stage convolutional part heatmap regression for the 1st 3d face alignment in the wild (3DFAW) challenge 2016
PDF

This paper describes our submission to the 1st 3D Face Alignment in the Wild (3DFAW) Challenge. Our method builds upon the idea of convolutional part heatmap regression [1], extending it for 3D face alignment. Our method decomposes the problem into two parts: (a) X,Y (2D) estimation and (b) Z (depth) estimation. At the first stage, our method estimates the X,Y coordinates of the facial landmarks by producing a set of 2D heatmaps, one for each landmark, using convolutional part heatmap regression. Then, these heatmaps, alongside the input RGB image, are used as input to a very deep subnetwork trained via residual learning for regressing the Z coordinate. Our method ranked 1st in the 3DFAW Challenge, surpassing the second best result by more than 22%.
```
@inproceedings{bulat2016two,
    title={Two-stage convolutional part heatmap regression for the 1st 3d face alignment in the wild (3dfaw) challenge},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    booktitle={European Conference on Computer Vision},
    pages={616--624},
    year={2016},
    organization={Springer}
}
```
2016
Human pose estimation via convolutional part heatmap regression

Adrian Bulat, Georgios Tzimiropoulos

European Conference on Computer Vision (ECCV), 2016
Human pose estimation via convolutional part heatmap regression 2016
PDF Code

This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps. The benefits of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning. Finally, we illustrate that our cascade achieves top performance on the MPII and LSP data sets.
```
@inproceedings{bulat2016human,
    title={Human pose estimation via convolutional part heatmap regression},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    booktitle={European Conference on Computer Vision},
    pages={717--732},
    year={2016},
    organization={Springer}
}
```
2016
Convolutional aggregation of local evidence for large pose face alignment

Adrian Bulat, Georgios Tzimiropoulos

British Machine Vision Conference (BMVC), 2016
Convolutional aggregation of local evidence for large pose face alignment 2016
PDF

Methods for unconstrained face alignment must satisfy two requirements: they must not rely on accurate initialisation/face detection and they should perform equally well for the whole spectrum of facial poses. To the best of our knowledge, there are no methods meeting these requirements to satisfactory extent, and in this paper, we propose Convolutional Aggregation of Local Evidence (CALE), a Convolutional Neural Network (CNN) architecture particularly designed for addressing both of them. In particular, to remove the requirement for accurate face detection, our system firstly performs facial part detection, providing confidence scores for the location of each of the facial landmarks (local evidence). Next, these score maps along with early CNN features are aggregated by our system through joint regression in order to refine the landmarks' location. Besides playing the role of a graphical model, CNN regression is a key feature of our system, guiding the network to rely on context for predicting the location of occluded landmarks, typically encountered in very large poses. The whole system is trained end-to-end with intermediate supervision. When applied to AFLW-PIFA, the most challenging human face alignment test set to date, our method provides more than 50% gain in localisation accuracy when compared to other recently published methods for large pose face alignment. Going beyond human faces, we also demonstrate that CALE is effective in dealing with very large changes in shape and appearance, typically encountered in animal faces.
```
@inproceedings{bulat2016convolutional,
    title={Convolutional aggregation of local evidence for large pose face alignment},
    author={Bulat, Adrian and Tzimiropoulos, Georgios},
    booktitle={British Machine Vision Conference},
    year={2016}
}
```

New VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

New Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

New Compress & Cache: Vision token compression for efficient generation and retrieval

New Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

New VladVA: Discriminative Fine-tuning of LVLMs

New FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

QBB: Quantization with Binary Bases for LLMs

Efficient Vision-Language pre-training via domain-specific learning for human activities

Knowledge Distillation Meets Open-Set Semi-Supervised Learning

New CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

New You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

New FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

ReGen: A good Generative zero-shot video classifier should be Rewarded

Black Box Few-Shot Adaptation for Vision-Language models

FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

Bayesian Prompt Learning for Image-Language Model Generalization

Pre-training strategies and datasets for facial representation learning

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Space-time Mixing Attention for Video Transformer

Bit-Mixer: Mixed-precision networks with runtime bit-width selection

Improving memory banks for unsupervised learning with large mini-batch,consistency and hard negative mining

Knowledge Distillation via Softmax Regression Representation Learning

Estimation of continuous valence and arousal levels from faces in naturalistic conditions

BATS: Binary ArchitecTure Search

Toward fast and accurate human pose estimation via soft-gated skip connections

Semi-supervised AU Intensity Estimation with Contrastive Learning

Incremental multi-domain learning with network latent tensor factorization

FAN-Face: a Simple Orthogonal Improvement to Deep Face Recognition

Factorized Higher-Order CNNs with an Application to Spatio-Temporal Emotion Estimation

T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor

To learn image super-resolution, use a GAN to learn how to do image degradation first

Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs

Hierarchical binary CNNs for landmark localization with limited resources

Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression

Two-stage convolutional part heatmap regression for the 1st 3d face alignment in the wild (3DFAW) challenge

Convolutional aggregation of local evidence for large pose face alignment