ACM Multimedia 2025 Grand Challenge

Advancing the frontiers of identity-preserving generative video models

Leaderboard

Phase 1

The first phase involved scoring all participating teams based on their rankings across various objective metrics. The top three teams with the highest aggregate scores advanced to the second phase.

Team Name Face-Cur (↑) Face-Arc (↑) FID (↓) ClipScore (↑)
ghl (PKUVideo) 0.492 0.473 170 27.8
XuanYuan 0.467 0.441 214 28.0
Wislab 0.285 0.269 208 28.6

Phase 2

For the second phase, the results from the top three teams underwent a user study. We constructed 3,000 1v1 comparison pairs by evaluating all pairwise combinations of team results for each test sample. In each head-to-head comparison, a win received 1 point, a draw 0.5 points, and a loss 0 points.

Team Name Score Rank
ghl (PKUVideo) 1258 1
XuanYuan 1147.5 2
Wislab 594.5 3

Challenge Overview

In this grand challenge, we introduce Identity-Preserving Video Generation (IPVG) task, which maintains the consistency of given reference identity along text-to-video generation process. This year's IPVG grand challenge includes two tracks:

Facial Identity-Preserving Video Generation
Full-body Identity-Preserving Video Generation

To further motivate and challenge the academic and industrial research communities, we have released a new dataset: Identity-Preserving Video Benchmark (VIP-200K), consisting of approximately 500,000 video-prompt pairs with 200,000 unique identities.

Task Description

This year we will focus on two tasks:

01

Facial identity-preserving text-to-video generation

Given videos and corresponding prompts plus reference identity facial images, the goal is to synthesize temporally consistent videos that align with prompts while maintaining identity preservation.

02

Full-body identity-preserving text-to-video generation

This track extends the first by enforcing identity-preserving constraints on hairstyle, face, clothing, and other attributes across generated frames.

Contestants must develop an identity-preserving video generation system based on the VIP-200K dataset. For evaluation, systems must generate at least one video per pair in the test set.

Datasets

To formalize the task of identity-preserving text-to-video generation, we provide the following datasets:

Training Dataset

500,000 videos in VIP-200K, each coupled with a textual prompt and one or more identity images.

Testing Dataset

200 unseen person IDs. Each ID has portrait images and five textual prompts for video generation, totaling 1,000 test pairs.

Dataset Context Source #Video #IDs #Hours #Prompt
VIP-200K Video-Prompt-identity triplets Automatic crawling from web 500,000 200,000 1,700 500,000

To access the dataset, please register via this form and download from HuggingFace Dataset

Test Submission

We provide two separate test datasets for the VIP200K challenge tracks. Download the appropriate dataset, prepare your results following our structure, and submit via Google Drive.

Facial Identity-Preserving Track

Download the facial track test dataset containing 200 identities with reference images and prompts.

Download Test Set

Full-body Identity-Preserving Track

Download the full-body track test dataset containing 200 identities with reference images and prompts.

Download Test Set

Test Dataset Content

Each dataset contains 200 identities (IDs) organized in folders. Per ID folder: 1 reference image file (image.png or image.webp) and 5 text prompt files (prompt1.txt to prompt5.txt).

Directory Structure

testset/
├── id001/
│   ├── image.png          # Reference image
│   ├── prompt1.txt        # Text prompt
│   ├── ...
│   └── prompt5.txt
├── ...
└── id200/
01

Prepare Results

Generate videos following our structure.

submission/
├── id001/
│   ├── prompt1.mp4
│   ├── prompt2.mp4
│   ├── prompt3.mp4
│   ├── prompt4.mp4
│   └── prompt5.mp4
├── ...
└── id200/
02

Package & Submit

Compress your results and submit via Google Drive with proper sharing permissions.

Step 1: Compress results into facial_results.zip or fullbody_results.zip
Step 2: Upload to Google Drive with sharing: "Anyone with the link can view"
Step 3: Send shareable URL(s) to panyw.ustc@gmail.com

Videos must be encoded in H.264 and saved in MP4 format. Non-compliant submissions may be disqualified.

Important: Only the most recent submission per track will be evaluated. Results will be downloaded after the submission deadline. Strictly follow the specified folder structure and naming conventions.

Evaluation Metric

Videos will be assessed based on:

01

Identity Preservation

Feature similarity with the reference identity image and manual annotation scores.

02

Video Quality

Evaluated via visual quality, motion dynamics, and text alignment using both objective metrics and human assessment.

Final scores combine objective and subjective evaluation results.

Participation

The challenge is team-based. Participants can enter one or both tracks. Teams can have multiple members, but individuals cannot be in multiple teams.

Top three teams per track will receive awards. Accepted submissions qualify for the conference's grand challenge award.

Timeline

March 8, 2025

Website & Call for Participation

March 15, 2025

Dataset release

June 5, 2025

Testing set release

June 20, 2025

Results submission

June 26, 2025 June 28, 2025

Evaluation results announcement

June 30, 2025 July 5, 2025

Paper submission deadline

Paper Submission

We will invite the top-3 performing teams to submit a technical paper (up to 6+2 pages) via email (panyw.ustc@gmail.com), which will be peer-reviewed. The paper submission deadline is June 30, 2025 July 5, 2025. Accepted papers will be published in the conference proceedings of ACM Multimedia 2025.

Paper format: Submitted papers (.pdf format) must use the ACM Article Template (https://www.acm.org/publications/proceedings-template). Please remember to add Concepts and Keywords. Please use the template in traditional double-column format to prepare your submissions. Please list all the author information in your submission.