This paper provides an overview of the recent advancements in foundation models and discusses potential applications of these models in the field of biometrics. Foundation models, such as Large Language Models (LLMs), Vision Language Models (VLMs), Audio-Language Models (ALMs), and Large Multi-modal Models (LMMs), are based on large neural networks which are trained with massive amounts of data and enable robust feature extraction for transfer learning. These models allow efficient zero-shot and few-shot learning, achieving state-of-the-art performance in downstream tasks. Biometrics is also an active field of research, which involves various research problems, ranging from robust recognition to security and privacy in biometric systems. In this paper, we present an in-depth analysis of state-of-the-art methodologies regarding foundation multi-modal models, their advancements, and their applicability to biometrics tasks. We also highlight current limitations and provide insights into potential future research directions in the applications of foundation models in biometrics. To our knowledge, this paper is the first survey which investigates the applications of foundation models in biometrics. [ Link to pre-print ] [ Link to paper on IEEE-Xplore (Open Access) ]
The survey is structured as follows for clarity and readability:
-
Foundation Models: In this section, we review recent advancements in foundation models and mention state-of-the-art models. We catgeorise foundation models in four different catgories:
- Large Language Models (LLMs)
- Vision Language Models (VLMs)
- Audio-Language Models (ALMs)
- Large Multi-modal Models (LMMs)
-
Biometric Recognition and Security: In this section, we review the general pipeline of biometric systems. We describe attack points in biometric systems and discuss security and privacy threats. For information about this section, we refer the readers to Section III of our survey paper.
-
Applications of Foundation Models in Biometrics: In this section, we review recent papers on the applications of foundation models in biometrics:
- Foundation Models for Biometric Recognition
- Foundation Models for Soft-biometric Detection
- Foundation Models for Deepfake and Forgery Detection
- Foundation Models for Anti-spoofing
- Foundation Models for Synthetic Biometric Generation
In this section, we review recent advancements in foundation models and mention state-of-the-art models. We catgeorise foundation models in four different catgories:
- Large Language Models (LLMs)
- Vision Language Models (VLMs)
- Audio-Language Models (ALMs)
- Large Multi-modal Models (LMMs)
| Model | Paper Title | Year | Paper | Code |
|---|---|---|---|---|
| GPT | Improving language understanding by generative pre-training | 2018 | link | link |
| GPT-2 | Language models are unsupervised multitask learners | 2019 | link | link |
| GPT-3 | Language models are few-shot learners | 2020 | link | link |
| GPT-4 | GPT-4 technical report | 2023 | link | NA |
| o1 | OpenAI o1 System Card | 2024 | link | NA |
| o3-mini | OpenAI o3-mini System Card | 2025 | link | NA |
| BERT | BERT: Pretraining of deep bidirectional transformers for language understanding | 2018 | link | link |
| T5 | Exploring the limits of transfer learning with a unified text-to-text transformer | 2020 | link | link |
| FLAN-T5 | Scaling instruction-finetuned language models | 2024 | link | link |
| OPT | OPT: Open pre-trained transformer language models | 2020 | link | NA |
| Falcon | The Falcon Series of Open Language Models | 2023 | link | link |
| Mistral | Mistral 7B | 2023 | link | link |
| Mixtral | Mixtral of experts | 2023 | link | link |
| LLaMA | LLaMA: Open and efficient foundation language models | 2023 | link | link |
| LLaMA 2 | Llama 2: Open foundation and fine-tuned chat models | 2023 | link | link |
| Vicuna | Judging LLM-as-a-judge with MT-bench and chatbot arena | 2023 | link | link |
| Gemma | Gemma: Open models based on Gemini research and technology | 2024 | link | link |
| Gemma 2 | Gemma 2: Improving open language models at a practical size | 2024 | link | link |
| Nemotron 4 | Nemotron-4 340B technical report | 2024 | link | link |
| Qwen | Qwen technical report | 2023 | link | link |
| Qwen 2.5 | Qwen2.5 technical report | 2024 | link | link |
| Qwen 3 | Qwen3 technical report | 2025 | link | link |
| Phi-4 | Phi-4 technical report | 2024 | link | link |
| DeepSeek | DeepSeek LLM: Scaling open-source language models with longtermism | 2024 | link | link |
| DeepSeek-V2 | DeepSeek-v2: A strong, economical, and efficient mixture-of-experts language model | 2024 | link | link |
| DeepSeek-V3 | DeepSeek-V3 technical report | 2024 | link | link |
| DeepSeek-R1 | DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learnin | 2025 | link | link |
| ReasonFlux | ReasonFlux: Hierarchical LLM reasoning via scaling thought templates | 2025 | link | link |
| Model | Paper Title | Year | Paper | Code |
|---|---|---|---|---|
| DINO | Emerging properties in self-supervised vision transformers | 2021 | link | link |
| DINOv2 | Dinov2: Learning robust visual features without supervision | 2023 | link | link |
| BEiT | Beit: Bert pre-training of image transformers | 2021 | link | link |
| CLIP | Learning transferable visual models from natural language supervision | 2021 | link | link |
| ALIGN | Scaling up visual and vision-language representation learning with noisy text supervision | 2021 | link | NA |
| FLAVA | Flava: A foundational language and vision alignment model | 2022 | link | link |
| Florence | Florence: A new foundation model for computer vision | 2021 | link | NA |
| OFA | Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework | 2022 | link | link |
| Unified-IO | Unified-io: A unified model for vision, language, and multi-modal tasks | 2022 | link | link |
| AIM | Scalable Pre-training of Large Autoregressive Image Models | 2024 | link | link |
| AIMv2 | Multimodal Autoregressive Pre-training of Large Vision Encoders | 2024 | link | link |
| BLIP | Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation | 2022 | link | link |
| BLIP 2 | Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models | 2023 | link | link |
| SigLIP | Sigmoid loss for language image pre-training | 2023 | link | link |
| SigLIP 2 | Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features | 2025 | link | link |
| OpenCLIP | Reproducible Scaling Laws for Contrastive Language-Image Learning | 2023 | link | link |
| SAM | Segment anything | 2023 | link | link |
| SAM~2 | Sam 2: Segment anything in images and videos | 2024 | link | link |
| DALL-E | Zero-shot text-to-image generation | 2021 | link | NA |
| DALL-E 2 | Hierarchical text-conditional image generation with clip latents | 2022 | link | NA |
| DALL-E~3 | Improving image generation with better captions | 2023 | link | NA |
| Stable Diffusion | High-resolution image synthesis with latent diffusion models | 2022 | link | link |
| Imagen 3 | Imagen 3 | 2024 | link | NA |
| Edify | Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models | 2024 | link | NA |
| LlamaGen | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation | 2024 | link | link |
| GPT-4V | GPT-4V | 2024 | link | NA |
| MiniGPT-4 | Minigpt-4: Enhancing vision-language understanding with advanced large language models | 2023 | link | link |
| Flamingo | Flamingo: a Visual Language Model for Few-Shot Learning | 2022 | link | NA |
| LLaVa | Improved baselines with visual instruction tuning | 2024 | link | link |
| Video-LLaVa | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | 2023 | link | link |
| Pixtral | Pixtral 12B | 2024 | link | link |
| Phi-3.5-Vision | Phi-3 technical report: A highly capable language model locally on your phone | 2024 | link | link |
| VILA | Vila: On pre-training for visual language models | 2024 | link | link |
| NVILA | NVILA: Efficient frontier visual language models | 2024 | link | link |
| VILA-U | Vila-u: a unified foundation model integrating visual understanding and generation | 2024 | link | link |
| TokenFlow | Tokenflow: Unified image tokenizer for multimodal understanding and generation | 2024 | link | link |
| VAR | Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction | 2024 | link | link |
| InstructBLIP | Instructblip: Towards general-purpose vision-language models with instruction tuning | 2023 | link | link |
| Yi-VL | Yi: Open foundation models by 01. ai | 2024 | link | link |
| Qwen-VL | Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond | 2023 | link | link |
| Qwen2-VL | Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | 2024 | link | link |
| Qwen2.5-VL | Qwen2.5-VL Technical Report | 2025 | link | link |
| InternVL | Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks | 2024 | link | link |
| InternVL 1.5 | How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites | 2024 | link | link |
| InternVL3 | Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models | 2025 | link | link |
| InternVideo2 | InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | 2024 | link | link |
| LLaVa-OneVision | LLaVA-OneVision: Easy Visual Task Transfer | 2024 | link | link |
| LLaVa-NeXT | Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models | 2024 | link | link |
| CogVLM2 | Cogvlm2: Visual language models for image and video understanding | 2024 | link | link |
| Bunny | Efficient multimodal learning from data-centric perspective | 2024 | link | link |
| Chameleon | Chameleon: Mixed-Modal Early-Fusion Foundation Models | 2024 | link | link |
| Apollo | Apollo: An Exploration of Video Understanding in Large Multimodal Models | 2024 | link | link |
| DeepSeek-VL | DeepSeek-VL: Towards Real-World Vision-Language Understanding | 2024 | link | link |
| DeepSeek-VL2 | DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | 2024 | link | link |
| Emu 3 | Emu3: Next-Token Prediction is All You Need | 2024 | link | link |
| Janus | Janus: Decoupling visual encoding for unified multimodal understanding and generation | 2024 | link | link |
| JanusFlow | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | 2024 | link | link |
| Janus-Pro | Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | 2025 | link | link |
| Movie Gen | Movie Gen: A Cast of Media Foundation Models | 2024 | link | NA |
| Mochi | [blog] Mochi 1: A new SOTA in open text-to-video | 2024 | link | link |
| Imagen Video | Imagen video: High definition video generation with diffusion models | 2022 | link | NA |
| Make-A-Video | Make-A-Video: Text-to-Video Generation without Text-Video Data | 2023 | link | link |
| Tune-A-Video | Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation | 2023 | link | link |
| PixelDance | Make pixels dance: High-dynamic video generation | 2024 | link | link |
| CogVideoX | CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer | 2024 | link | link |
| FlashVideo | FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation | 2025 | link | link |
| Goku | Goku: Flow Based Video Generative Foundation Models | 2025 | link | link |
| T2V | Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model | 2025 | link | link |
| Sora | [blog] Sora: Creating Video from Text | 2024 | link | NA |
| Model | Paper Title | Year | Paper | Code |
|---|---|---|---|---|
| Wav2Vec 2.0 | wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | 2020 | link | link |
| HuBERT | Hubert: Self-supervised speech representation learning by masked prediction of hidden units | 2021 | link | link |
| WavLM | Wavlm: Large-scale self-supervised pre-training for full stack speech processing | 2022 | link | link |
| Whisper | Robust speech recognition via large-scale weak supervision | 2023 | link | link |
| USM | Google usm: Scaling automatic speech recognition beyond 100 languages | 2023 | link | NA |
| UniAudio | Uniaudio: An audio foundation model toward universal audio generation | 2023 | link | link |
| MERT | Mert: Acoustic music understanding model with large-scale self-supervised training | 2023 | link | link |
| CLAP | Clap learning audio concepts from natural language supervision | 2023 | link | link |
| SenseVoice | Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms | 2024 | link | link |
| CosyVoice | Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens | 2024 | link | link |
| Vall-E | Neural codec language models are zero-shot text to speech synthesizers | 2023 | link | NA |
| SpeechT5 | Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing | 2021 | link | link |
| SLM | Slm: Bridge the thin gap between speech and text foundation models | 2023 | link | NA |
| AudioGPT | Audiogpt: Understanding and generating speech, music, sound, and talking head | 2024 | link | link |
| SpeechGPT | SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities | 2023 | link | link |
| AudioPaLM | Audiopalm: A large language model that can speak and listen | 2023 | link | NA |
| SALMONN | SALMONN: Towards Generic Hearing Abilities for Large Language Models | 2024 | link | link |
| WavLLM | Wavllm: Towards robust and adaptive speech large language model | 2024 | link | link |
| Pengi | Pengi: An audio language model for audio tasks | 2023 | link | link |
| LTU | Listen, Think, and Understand | 2024 | link | link |
| GAMA | GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities | 2024 | link | link |
| Qwen2-Audio | Qwen2-audio technical report | 2024 | link | link |
| SeamlessM4T | SeamlessM4T-Massively Multilingual & Multimodal Machine Translation | 2023 | link | link |
| Step-Audio | Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction | 2025 | link | link |
| MusicLM | Musiclm: Generating music from text | 2023 | link | NA |
| AudioLDM | Audioldm: Text-to-audio generation with latent diffusion models | 2023 | link | link |
| Model | Paper Title | Year | Paper | Code |
|---|---|---|---|---|
| AudioCLIP | Audioclip: Extending clip to image, text and audio | 2022 | link | link |
| 4M | 4m: Massively multimodal masked modeling | 2023 | link | link |
| ImageBind | Imagebind: One embedding space to bind them all | 2023 | link | link |
| PandaGPT | Pandagpt: One model to instruction-follow them all | 2023 | link | link |
| NeXT-GPT | NExT-GPT: Any-to-Any Multimodal LLM | 2023 | link | link |
| Video-LLaMA | Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | 2023 | link | link |
| Video-LLaMA2 | VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | 2024 | link | link |
| Video-SALMONN | video-salmonn: Speech-enhanced audio-visual large language models | 2024 | link | link |
| Gemini Pro | Gemini: a family of highly capable multimodal models | 2023 | link | NA |
| LLaMA 3 | The llama 3 herd of models | 2024 | link | link |
| Qwen2.5-Omni | Qwen2. 5-omni technical report | 2025 | link | link |
| GPT-4o | GPT-4o System Card | 2024 | link | NA |
In this section, we review recent papers on the applications of foundation models in biometrics:
- Foundation Models for Biometric Recognition
- Foundation Models for Soft-biometric Detection
- Foundation Models for Deepfake and Forgery Detection
- Foundation Models for Anti-spoofing
- Foundation Models for Synthetic Biometric Generation
| Paper Title | Year | Modality / Task | Paper | Code |
|---|---|---|---|---|
| Exploring wav2vec 2.0 on speaker verification and language identification | 2020 | speaker and language identification | link | NA |
| ChatGPT and biometrics: an assessment of face recognition, gender detection, and age estimation capabilities | 2024 | face verification, gender detection, age estimation | link | NA |
| How Good is ChatGPT at Face Biometrics? A First Look into Recognition, Soft Biometrics, and Explainability | 2024 | face verification | link | NA |
| ChatGPT Meets Iris Biometrics | 2024 | iris recognition | link | NA |
| Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition | 2025 | face verification | link | NA |
| Benchmarking Foundation Models for Zero-Shot Biometric Tasks | 2025 | face verification, soft biometric attribute prediction (gender and race), iris recognition, iris presentation attack detection, face morph detection, and face deepfake detection | link | NA |
| A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding | 2021 | speaker verification | link | link |
| Iris-SAM: Iris Segmentation Using a Foundation Model | 2024 | iris segmentation | link | link |
| SAM-Iris: A SAM-Based Iris Segmentation Algorithm | 2025 | iris segmentation | link | NA |
| Froundation: Are foundation models ready for face recognition? | 2024 | face recognition | link | link |
| HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding | 2025 | audio-visual human video recognition (emotion recognition, expression description, and action understanding) | link | link |
| FaceLLM: A Multimodal Large Language Model for Face Understanding | 2025 | face recognition, anti-spoofing, deepfake detection, attribute prediction, expression, parsing, pose, crowd counting | link | link |
| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | 2025 | face recognition, anti-spoofing, deepfake detection, attribute prediction, expression, parsing, pose, crowd counting | link | link |
| Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants | 2025 | facial attributes, age estimation, expression recognition, attack detection, recognition; human attributes, action, spatial/social relations, re-ID | link | link |
| From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing | 2024 | face recognition explainability | link | NA |
| FaceOracle: Chat with a Face Image Oracle | 2025 | face image quality assessment | link | NA |
| Unispeech-sat: Universal speech representation learning with speaker aware pre-training | 2022 | speaker ID, verification, diarization, phoneme recognition, keyword spotting, emotion recognition | link | link |
| Large-scale self-supervised speech representation learning for automatic speaker verification | 2022 | speaker verification | link | link |
| General facial representation learning in a visual-linguistic manner | 2022 | face parsing, alignment, attribute recognition | link | link |
| Marlin: Masked autoencoder for facial video representation learning | 2023 | face attribute recognition, expression recognition, deepfake detection, lip synchronization | link | link |
| Self-Supervised Facial Representation Learning with Facial Region Awareness | 2024 | face expression and attribute recognition | link | link |
| Pose-disentangled contrastive learning for self-supervised facial representation | 2023 | face expression, face recognition, head pose estimation | link | link |
| Pros: Facial omni-representation learning via prototype-based self-distillation | 2024 | face parsing, attribute recognition, emotion detection, landmark detection | link | link |
| ComFace: Facial Representation Learning with Synthetic Data for Comparing Faces | 2024 | face expression change, weight change, age change estimation | link | NA |
| SwinFace: a multi-task transformer for face recognition, expression recognition, age estimation and attribute estimation | 2023 | face attributes, age estimation, expression recognition, face recognition | link | link |
| FaceXFormer: A Unified Transformer for Facial Analysis | 2024 | face parsing, landmarks, head pose estimation, age/gender/race estimation, attribute recognition, expression recognition, | link | link |
| Task-adaptive Q-Face | 2024 | head pose estimation, face attribute recognition, age estimation, expression recognition | link | NA |
| Faceptor: A generalist model for face perception | 2024 | face parsing, landmarks, age and gender estimation, attribute recognition, expression recognition, face recognition | link | link |
| Paper Title | Year | Modality / Task | Paper | Code |
|---|---|---|---|---|
| Robust light-weight facial affective behavior recognition with clip | 2024 | facial expression classification; action unit detection | link | link |
| Cliper: A unified vision-language framework for in-the-wild facial expression recognition | 2024 | face static & dynamic expression recognition | link | link |
| Emoclip: A vision-language method for zero-shot video facial expression recognition | 2024 | video facial emotion recognition | link | link |
| Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters | 2024 | dynamic facial expression recognition | link | NA |
| Face-mllm: A large face perception model | 2024 | face age/gender, expression, action units, attributes | link | NA |
| FaceGPT: Self-supervised Learning to Chat about 3D Human Faces | 2024 | face 3DMM parameter generation | link | NA |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | 2025 | face attribute detection | link | link |
| Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning | 2025 | face expression recognition, action unit detection, facial attribute detection, age estimation, and deepfake detection | link | link |
| FaceInsight: A Multimodal Large Language Model for Face Perception | 2025 | face attribute recognition, age/ gender/ race estimation, and expression prediction | link | NA |
| R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning | 2025 | audio-visual emotion recognition with reasoning | link | link |
| ChatGPT and biometrics: an assessment of face recognition, gender detection, and age estimation capabilities | 2024 | face gender detection, age estimation | link | NA |
| How Good is ChatGPT at Face Biometrics? A First Look into Recognition, Soft Biometrics, and Explainability | 2024 | age, gender, ethnicity, hair color | link | NA |
| ChatGPT Meets Iris Biometrics | 2024 | iris–face matching; soft-biometrics | link | NA |
| Paper Title | Year | Modality / Task | Paper | Code |
|---|---|---|---|---|
| MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection | 2024 | face forgery detection | link | link |
| Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection | 2024 | face forgery detection | link | link |
| MADation: Face Morphing Attack Detection with Foundation Models | 2025 | face morph attack detection | link | link |
| FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning | 2024 | deepfake detection, anti-spoofing, unseen diffusion forgery | link | link |
| Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation | 2022 | voice spoofing & deepfake detection | link | link |
| X2-dfd: A framework for explainable and extendable deepfake detection | 2024 | face deepfake detection | link | link |
| Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant | 2024 | forgery analysis assistant | link | link |
| Towards general visual-linguistic face forgery detection (v2) | 2025 | face forgery detection | link | link |
| Evaluating the Effectiveness of Attack-Agnostic Features for Morphing Attack Detection | 2024 | face morph attack detection | link | link |
| Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models | 2024 | speaker deepfake detection | link | NA |
| Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector | 2025 | face deepfake detection \newline+ description | link | link |
| Standing on the shoulders of giants: Reprogramming visual-language model for general deepfake detection | 2025 | face deepfake detection | link | link |
| Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics | 2024 | face deepfake detection | link | link |
| How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception | 2024 | audio-visual deepfake detection | link | NA |
| ChatGPT Encounters Morphing Attack Detection: Zero-Shot MAD with Multi-Modal Large Language Models and General Vision Models | 2025 | face morph detection | link | NA |
| Paper Title | Year | Modality / Task | Paper | Code |
|---|---|---|---|---|
| Flip: Cross-domain face anti-spoofing with language guidance | 2023 | fine‐tune CLIP image encoder for face (FLIP alignment) | link | link |
| On Self-Supervised Learning and Prompt Tuning of Vision Transformers for Cross-sensor Fingerprint Presentation Attack Detection | 2023 | SSL via masked‐fingerprint prediction with prompt tuning | link | NA |
| CPL-CLIP: Compound Prompt Learning for Flexible-Modal Face Anti-Spoofing | 2024 | face anti-spoofing | link | NA |
| Fm-clip: Flexible modal clip for face anti-spoofing | 2024 | cross‐modal antispoofing | link | NA |
| La-SoftMoE CLIP for Unified Physical-Digital Face Attack Detection | 2024 | Unified physical-digital face attack detection | link | NA |
| Cfpl-fas: Class free prompt learning for generalizable face anti-spoofing | 2024 | face anti-spoofing | link | NA |
| InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing | 2025 | face anti-spoofing | link | link |
| Reliable and Balanced Transfer Learning for Generalized Multimodal Face Anti-Spoofing | 2025 | Multimodal face anti-spoofing | link | link |
| FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models | 2025 | face anti-spoofing (classification and attack localization) | link | link |
| Interpretable face anti-spoofing: Enhancing generalization with multimodal large language models | 2025 | face anti-spoofing | link | NA |
| Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning | 2025 | face anti-spoofing (spoofing detection and reasoning) | link | NA |
| VL-FAS: Domain Generalization via Vision-Language Model For Face Anti-Spoofing | 2024 | face anti‐spoofing | link | NA |
| FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection | 2025 | face anti‐spoofing | link | link |
| Towards Iris Presentation Attack Detection with Foundation Models | 2025 | iris anti‐spoofing | link | NA |
| Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning | 2025 | face presentation attack detection | link | link |
| Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection? | 2025 | face presentation attack detection | link | link |
| Shield: An evaluation benchmark for face spoofing and forgery detection with multimodal large language models | 2025 | face anti-spoofing (RGB, infrared, depth) and forgery detection | link | link |
| ChatGPT Meets Iris Biometrics | 2024 | iris presentation‐attack detection | link | NA |
| Paper Title | Year | Modality / Task | Paper | Code |
|---|---|---|---|---|
| Toward open-world text-driven face generation and manipulation via stylegan3 | 2024 | Text-to-face synthesis | link | NA |
| AnyFace++: A unified framework for free-style text-to-face synthesis and manipulation | 2024 | Text-guided face editing | link | NA |
| AnyFace: Free-style text-to-face synthesis and manipulation | 2022 | Text-to-face generation | link | NA |
| Towards counterfactual image manipulation via clip | 2022 | Controllable text-to-face | link | link |
| Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation | 2024 | Prompt-based face synthesis | link | NA |
| Tecm-clip: Text-based controllable multi-attribute face image manipulation | 2022 | face attribute / expression editing | link | link |
| Stylemc: Multi-channel based fast text-guided image generation and manipulation | 2022 | face multi-attribute editing | link | link |
| Photoverse: Tuning-free image customization with text-to-image diffusion models | 2023 | Few-shot personalised face portrait generation | link | link |
| Fastcomposer: Tuning-free multi-subject image generation with localized attention | 2024 | fast subject-driven face text-to-image | link | link |
| Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation | 2024 | multi-concept face portrait generation | link | NA |
| Photomaker: Customizing realistic human photos via stacked id embedding | 2024 | high-fidelity face personalisation | link | link |
| Face0: Instantaneously conditioning a text-to-image model on a face | 2023 | Identity-preserving face text-to-image | link | NA |
| Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models | 2023 | face instant personalisation | link | link |
| Dreamidentity: Improved editability for efficient face-identity preserved image generation | 2023 | face identity-guided generation | link | NA |
| Portraitbooth: A versatile portrait model for fast identity-preserved personalization | 2024 | face few-shot portrait generation | link | NA |
| Instantid: Zero-shot identity-preserving generation in seconds | 2024 | face real-time personalisation | link | link |
| ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning | 2024 | face identity-consistent generation | link | link |
| Facestudio: Put your face everywhere in seconds | 2023 | face ID & style controllable text-to-image | link | link |
| IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models | 2024 | identity-aware face editing | link | NA |
| Arc2face: A foundation model for id-consistent human faces | 2024 | identity-conditioned face generation | link | link |
| Face Reconstruction from Face Embeddings using Adapter to a Face Foundation Model | 2024 | General identity-conditioned face generation | link | link |
| Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance | 2025 | Identity-conditioned 3D head / avatar generation | link | link |
| ClipSwap: Towards High Fidelity Face Swapping via Attributes and CLIP-Informed Loss | 2024 | Face swapping | link | NA |
We will keep updating the git repository and webpage of our survey. Please contact the first author (hatef.otroshi@idiap.ch) or complete the following form to add your paper:
We appreciate your contributions and look forward to keeping this survey comprehensive and up to date!
If you find this survey useful, please consider citing it:
@article{fmbiometrics2025survey,
title={Foundation Models and Biometrics: A Survey and Outlook},
author={Hatef Otroshi Shahreza and S{\'e}bastien Marcel},
journal={IEEE Transactions on Information Forensics and Security},
year={2025},
publisher={IEEE}
}