RASA: Disentangled Spatial-Motional Priors for Cross-Identity Character Animation

Abstract

Cross-identity character animation aims to drive a target identity (from a reference image) to follow the motion of a source character (from a driving video). The core challenge lies in the inherent entanglement of two essential capabilities: cross-identity spatial mapping—aligning position, scale, and skeletal proportions between the reference and the driving pose—and subsequent motion control—refining joint articulation, volumetric consistency, and view coherence during generation. In this paper, we introduce Reference-Aware Structural Alignment (RASA), a novel framework that systematically disentangles spatial mapping from motion control by injecting structured priors into a Diffusion Transformer (DiT). Our approach operates in two complementary stages. First, a Spatial Prior Calibrator (SPC) fuses the reference identity with the driving pose to generate an initial noise latent that is spatially grounded—it ensures the target character is correctly positioned, scaled, and proportionally aligned with the driving skeleton. This resolves cross-identity spatial mismatches at the very start of generation. Second, to achieve identity-agnostic motion control, we propose an Inherent Motional Guider (IMG). Moving beyond appearance-biased 2D keypoints, IMG encodes shape-agnostic SMPL articulation parameters into a semantic motion vector. Injected into the intermediate layers of the DiT, this vector serves as complementary guidance that works in tandem with the base pose condition, providing anatomically consistent joint articulation and view-aware volumetric refinement. To rigorously evaluate this challenging task, we curate CIM-Bench, a high-quality benchmark with rigorous manual curation. Extensive experiments demonstrate that RASA significantly outperforms state-of-the-art methods in both motion fidelity and visual quality. Our work establishes a new paradigm for cross-identity animation, showing that disentangled spatial and motional priors are key to achieving robust and consistent character animation.

Overview

The overview of RASA. Given a reference image and its extracted pose, a driving video and its corresponding pose sequence, as well as SMPL-based motion parameters, we first simulate structural misalignment by applying pose transformations to the driving poses. The motion parameters of the first frame are replicated three times to align temporally with the video latents. Through the Spatial Prior Calibrator module and the Inherent Motional Guider module, motion information is accurately transferred to the reference image while preserving structural consistency.

Animate Diverse Reference Using Same Driving

Animate Multi Character

More Results

Citation

If you find this work useful, please cite our paper:

@inproceedings{xiao2026rasa,
  title={RASA: Disentangled Spatial-Motional Priors for Cross-Identity Character Animation},
  author={Xiao, Zhen and Shen, Zhen and Qiu, Zhaofan and Yao, Ting and Liu, Xueliang and Mei, Tao},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}