Academic Project Page

🥰 Abstract

Recent advances in human motion prediction (HMP) have shifted focus from isolated motion data to integrating human-scene correlations. In particular, the latest methods leverage human gaze points, using their spatial coordinates to indicate intent—where a person might move within a 3D environment. Despite promising trajectory results, these methods often produce inaccurate poses by overlooking the semantic implications of gaze, specifically the affordances of observed objects, which indicate the possible interactions. To address this, we propose GAP3DS, an affordance-aware HMP model that utilizes gaze-informed object affordances to improve HMP in complex 3D environments. GAP3DS incorporates a gaze-guided affordance learner to identify relevant objects in the scene and infer their affordances based on human gaze, thus contextualizing future human-object interactions. This affordance information, enriched with visual features and gaze data, conditions the generation of multiple human-object interaction poses, which are subsequently decoded into final motion predictions. Extensive experiments on two real-world datasets demonstrate that GAP3DS outperforms state-of-the-art methods in both trajectory and pose accuracy, producing more physically consistent and contextually grounded predictions.

😘 Video

🏃 we propose GAP3DS, an affordance-aware human motion prediction (HMP) model that enhances realism and accuracy in real-world 3D environments.

🥳 GAP3DS's Results

GAP3DS does a good job at keeping both the path and the movements smooth. The last position fits well with the earlier ones, making it look like a natural and continuous action, especially when it's about picking up the box.

GAP3DS does a great job of finding its way around the table to get to the target chair. It predicts movements that make sense and stay true to what actually happened. The path it follows is close to the real one, and the poses it creates for sitting down are spot-on. This results in a smooth and realistic sequence of movements that respects the space and objects around it.

BibTeX

@inproceedings{yu2025visionguided,
    author = {T. Yu and Y. Lin and J. Yu and Z. Lou and Q. Cui},
    title = {Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2025},
    note = {CCF A}
}