Leaderboard
Phase 1
The first phase involved scoring all participating teams based on their rankings across various objective metrics. The top three teams with the highest aggregate scores advanced to the second phase.
| Team Name | Face-Cur (↑) | Face-Arc (↑) | FID (↓) | ClipScore (↑) |
|---|---|---|---|---|
| ghl (PKUVideo) | 0.492 | 0.473 | 170 | 27.8 |
| XuanYuan | 0.467 | 0.441 | 214 | 28.0 |
| Wislab | 0.285 | 0.269 | 208 | 28.6 |
Phase 2
For the second phase, the results from the top three teams underwent a user study. We constructed 3,000 1v1 comparison pairs by evaluating all pairwise combinations of team results for each test sample. In each head-to-head comparison, a win received 1 point, a draw 0.5 points, and a loss 0 points.
| Team Name | Score | Rank |
|---|---|---|
| ghl (PKUVideo) | 1258 | 1 |
| XuanYuan | 1147.5 | 2 |
| Wislab | 594.5 | 3 |
Challenge Overview
In this grand challenge, we introduce Identity-Preserving Video Generation (IPVG) task, which maintains the consistency of given reference identity along text-to-video generation process. This year's IPVG grand challenge includes two tracks:
To further motivate and challenge the academic and industrial research communities, we have released a new dataset: Identity-Preserving Video Benchmark (VIP-200K), consisting of approximately 500,000 video-prompt pairs with 200,000 unique identities.
Task Description
This year we will focus on two tasks:
Facial identity-preserving text-to-video generation
Given videos and corresponding prompts plus reference identity facial images, the goal is to synthesize temporally consistent videos that align with prompts while maintaining identity preservation.
Full-body identity-preserving text-to-video generation
This track extends the first by enforcing identity-preserving constraints on hairstyle, face, clothing, and other attributes across generated frames.
Contestants must develop an identity-preserving video generation system based on the VIP-200K dataset. For evaluation, systems must generate at least one video per
Datasets
To formalize the task of identity-preserving text-to-video generation, we provide the following datasets:
Training Dataset
500,000 videos in VIP-200K, each coupled with a textual prompt and one or more identity images.
Testing Dataset
200 unseen person IDs. Each ID has portrait images and five textual prompts for video generation, totaling 1,000 test pairs.
| Dataset | Context | Source | #Video | #IDs | #Hours | #Prompt |
|---|---|---|---|---|---|---|
| VIP-200K | Video-Prompt-identity triplets | Automatic crawling from web | 500,000 | 200,000 | 1,700 | 500,000 |
To access the dataset, please register via this form and download from
Test Submission
We provide two separate test datasets for the VIP200K challenge tracks. Download the appropriate dataset, prepare your results following our structure, and submit via Google Drive.
Facial Identity-Preserving Track
Download the facial track test dataset containing 200 identities with reference images and prompts.
Download Test SetFull-body Identity-Preserving Track
Download the full-body track test dataset containing 200 identities with reference images and prompts.
Download Test SetTest Dataset Content
Each dataset contains 200 identities (IDs) organized in folders. Per ID folder: 1 reference image file (image.png or image.webp) and 5 text prompt files (prompt1.txt to prompt5.txt).
Directory Structure
testset/ ├── id001/ │ ├── image.png # Reference image │ ├── prompt1.txt # Text prompt │ ├── ... │ └── prompt5.txt ├── ... └── id200/
Prepare Results
Generate videos following our structure.
submission/ ├── id001/ │ ├── prompt1.mp4 │ ├── prompt2.mp4 │ ├── prompt3.mp4 │ ├── prompt4.mp4 │ └── prompt5.mp4 ├── ... └── id200/
Package & Submit
Compress your results and submit via Google Drive with proper sharing permissions.
facial_results.zip or fullbody_results.zip
Videos must be encoded in H.264 and saved in MP4 format. Non-compliant submissions may be disqualified.
Important: Only the most recent submission per track will be evaluated. Results will be downloaded after the submission deadline. Strictly follow the specified folder structure and naming conventions.
Evaluation Metric
Videos will be assessed based on:
Identity Preservation
Feature similarity with the reference identity image and manual annotation scores.
Video Quality
Evaluated via visual quality, motion dynamics, and text alignment using both objective metrics and human assessment.
Final scores combine objective and subjective evaluation results.
Participation
The challenge is team-based. Participants can enter one or both tracks. Teams can have multiple members, but individuals cannot be in multiple teams.
Top three teams per track will receive awards. Accepted submissions qualify for the conference's grand challenge award.
Timeline
Website & Call for Participation
Dataset release
Testing set release
Results submission
Evaluation results announcement
Paper submission deadline
Paper Submission
We will invite the top-3 performing teams to submit a technical paper (up to 6+2 pages) via email (panyw.ustc@gmail.com), which will be peer-reviewed. The paper submission deadline is June 30, 2025 July 5, 2025. Accepted papers will be published in the conference proceedings of ACM Multimedia 2025.
Paper format: Submitted papers (.pdf format) must use the ACM Article Template (https://www.acm.org/publications/proceedings-template). Please remember to add Concepts and Keywords. Please use the template in traditional double-column format to prepare your submissions. Please list all the author information in your submission.