Looking for the 2025 edition? Visit IPVG Challenge @ ACM MM 2025

ACM Multimedia 2026 Grand Challenge

Advancing the frontiers of identity-preserving generative video models

Challenge Overview

In this grand challenge, we introduce Identity-Preserving Video Generation (IPVG) task, which maintains the consistency of given reference identity along text-to-video generation process. This year's IPVG grand challenge includes two tracks: Facial Identity-Preserving Video Generation and Sequential Action Identity-Preserving Video Generation. In general, the goal of this grand challenge is two-fold: (a) coalescing community effort around new challenging identity-preserving video generation datasets, and (b) offering a fertile ground for designing controllable generative models to facilitate precise identity binding in video generation, aiming to propel the field toward more accountable and user-steerable video synthesis systems.

Facial Identity-Preserving Video Generation
Sequential Action Identity-Preserving Video Generation

To further support the research community, this year's challenge provides two large-scale datasets: the Identity-Preserving Video Benchmark (VIP-200K), with approximately 500,000 video-prompt pairs and 200,000 unique identities, and the newly introduced ReactID-Data, which features subject–video pairs accompanied by structured timeline annotations for sequential action generation.

Task Description

This year we will focus on two tasks:

01

Facial identity-preserving video generation

Given videos and corresponding prompts plus reference identity facial images, the goal is to synthesize temporally consistent videos that align with prompts while maintaining identity preservation.

02

Sequential action identity-preserving video generation

Given reference images and timeline prompts specifying multiple actions with timestamps (e.g., "0–2s: [Action A]; 2–5s: [Action B]"), the goal is to generate videos where the subject accurately performs the sequence while consistently preserving identity. This track applies to full-body humans, animals, and objects.

Contestants must develop identity-preserving video generation systems. For Track 1, systems should be based on the VIP-200K dataset. For Track 2, systems should be based on the ReactID-Data dataset. For evaluation, systems must generate at least one video per <identity images, prompt> pair in the test set.

Datasets

To formalize the task of identity-preserving text-to-video generation, we provide the following datasets:

Track 1 — Facial Identity-Preserving Video Generation

Training Dataset

500,000 videos in VIP-200K, each coupled with a textual prompt and one or more identity images.

Testing Dataset

200 unseen person IDs. Each ID has portrait images and five textual prompts for video generation, totaling 1,000 test pairs.

Dataset Context Source #Video #IDs #Hours #Prompt
VIP-200K Video-Prompt-identity triplets Automatic crawling from web 500,000 200,000 1,700 500,000

To access the dataset, please register via this form and download from HuggingFace Dataset

Track 2 — Sequential Action Identity-Preserving Video Generation

Training Dataset

ReactID-Data. Each sample is a Video-Prompt-Identity set, where the Identity may include one or more entities (such as humans, animals, or objects). The 1.2M subset further includes structured timeline prompts (e.g., "0–2s: [Action A]; 2–5s: [Action B]").

Testing Dataset

200 groups of unseen IDs. Each group contains reference images for one or more subjects, and is provided with 5 different structured timeline prompts for video generation, for a total of 1,000 test samples.

Coming Soon: The Track 2 dataset (ReactID-Data) is currently under preparation and will be released shortly. Please stay tuned for updates.

Test Submission

We provide two separate test datasets for the challenge tracks. Download the appropriate dataset, prepare your results following our structure, and submit via Google Drive.

Track 1: Facial Identity-Preserving Track

Download the facial track test dataset containing 200 identities with reference images and prompts.

Download Test Set

Track 2: Sequential Action Identity-Preserving Track

Download the sequential action track test dataset containing 200 groups of IDs with reference images and structured timeline prompts.

Coming Soon

Test Dataset Content

Each dataset contains 200 identities (IDs) organized in folders. Per ID folder: 1 reference image file (image.png or image.webp) and 5 text prompt files (prompt1.txt to prompt5.txt).

Directory Structure

testset/
├── id001/
│   ├── image.png          # Reference image
│   ├── prompt1.txt        # Text prompt
│   ├── ...
│   └── prompt5.txt
├── ...
└── id200/
01

Prepare Results

Generate videos following our structure.

submission/
├── id001/
│   ├── prompt1.mp4
│   ├── prompt2.mp4
│   ├── prompt3.mp4
│   ├── prompt4.mp4
│   └── prompt5.mp4
├── ...
└── id200/
02

Package & Submit

Compress your results and submit via Google Drive with proper sharing permissions.

Step 1: Compress results into facial_results.zip (Track 1) or sequential_action_results.zip (Track 2)
Step 2: Upload to Google Drive with sharing: "Anyone with the link can view"
Step 3: Send shareable URL(s) to panyw.ustc@gmail.com

Videos must be encoded in H.264 and saved in MP4 format. Non-compliant submissions may be disqualified.

Important: Only the most recent submission per track will be evaluated. Results will be downloaded after the submission deadline. Strictly follow the specified folder structure and naming conventions.

Evaluation Metric

Videos will be assessed based on:

01

Identity Preservation

Feature similarity with the reference identity image and manual annotation scores.

02

Video Quality

Evaluated via visual quality, motion dynamics, and text alignment using both objective metrics and human assessment.

03

Action–Timeline Alignment (Track 2 only)

Evaluate how well the generated actions match the timestamps in the timeline prompts (text–video alignment and temporal consistency).

Final scores combine objective and subjective evaluation results.

Participation

The challenge is team-based. Participants can enter one or both tracks. Teams can have multiple members, but individuals cannot be in multiple teams.

Top three teams per track will receive awards. Accepted submissions qualify for the conference's grand challenge award.

Timeline

March 10, 2026

Website & Call for Participation Ready

March 15, 2026

Dataset available for download (training and validation sets)

May 10, 2026

Testing set of each track available for download

May 16, 2026

Results submission

May 17–21, 2026

Objective evaluation

May 22, 2026

Evaluation results announcement

May 28, 2026

Paper submission deadline

Paper Submission

We will invite the top-3 performing teams to submit a technical paper (up to 6+2 pages) via email (panyw.ustc@gmail.com), which will be peer-reviewed. The paper submission deadline is May 28, 2026. Accepted papers will be published in the conference proceedings of ACM Multimedia 2026.

Paper format: Submitted papers (.pdf format) must use the ACM Article Template (https://www.acm.org/publications/proceedings-template). Please remember to add Concepts and Keywords. Please use the template in traditional double-column format to prepare your submissions. Please list all the author information in your submission.