SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval

Peking University, Wangxuan Institute of Computer Technology
NeurIPS 2025

Abstract

In this paper, we study an explicit fine-grained concept decomposition for alignment learning and present a novel framework, Structural Generative Augmentation for 3D Human Motion Retrieval (SGAR), to enable generation-augmented retrieval. Extensive experiments on three benchmarks, including motion-text retrieval as well as recognition and generation applications, demonstrate the superior performance and promising transferability.

Method

SGAR Framework

To enable part-based motion alignment, we integrate structural linguistic knowledge from LLMs and propose a part-mixture learning strategy. By decoupling the motion representations of the global body and local parts, our method can facilitate alignment within different structural levels. In addition to independently minimizing the Euclidean distance of embeddings at global and local motion levels, a directional alignment objective is introduced to model the relational knowledge between them. This further alleviates over-fitting and leads to better representation consistency.

BibTeX

@inproceedings{zhangsgar2024,
  title={SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval},
  author={Zhang, Jiahang and Lin, Lilang and Yang, Shuai and Liu, Jiaying},
  booktitle={NeurIPS},
  year={2025}
}