Scene Synthesis from Human Motion

Abstract

Large-scale capture of human motion with diverse, complex scenes, while immensely useful, is often considered prohibitively costly. Meanwhile, human motion alone contains rich information about the scene they reside in and interact with. For example, a sitting human suggests the existence of a chair, and their leg position further implies the chair’s pose. In this paper, we propose to synthesize diverse, semantically reasonable, and physically plausible scenes based on human motion. Our framework, Scene Synthesis from HUMan MotiON (SUMMON), includes two steps. It first uses ContactFormer, our newly introduced contact predictor, to obtain temporally consistent contact labels from human motion. Based on these predictions, SUMMON then chooses interacting objects and optimizes physical plausibility losses; it further populates the scene with objects that do not interact with humans. Experimental results demonstrate that SUMMON synthesizes feasible, plausible, and diverse scenes and has the potential to generate extensive human-scene interaction data for the community.

Video

Method Overview

The overview of SUMMON: (a) an input sequence of human body meshes interacting with a scene, (b) the ContactFormer that predicts per-frame contact labels, (c) per-frame contact predictions, (d) estimated contact points, (e) synthesized objects, and (f) objects in interaction.

Experiment Results

Diverse scene generation

By leveraging predicted contact semantic labels, SUMMON can synthesize diverse plausible scenes from a human motion sequence. Hence it has the potential to generate extensive human-scene interaction data for the community.

Scene Completion

SUMMON further completes the scene by sampling and placing objects that are not in contact with humans.

BibTeX

@inproceedings{10.1145/3550469.3555426,
        author = {Ye, Sifan and Wang, Yixing and Li, Jiaman and Park, Dennis and Liu, C. Karen and Xu, Huazhe and Wu, Jiajun},
        title = {Scene Synthesis from Human Motion},
        year = {2022},
        isbn = {9781450394703},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        url = {https://doi.org/10.1145/3550469.3555426},
        doi = {10.1145/3550469.3555426},
        booktitle = {SIGGRAPH Asia 2022 Conference Papers},
        articleno = {26},
        numpages = {9},
        keywords = {Scene synthesis, activity understanding, motion analysis},
        location = {Daegu, Republic of Korea},
        series = {SA '22}
        }
}

Acknowledgements

This work is in part supported by the Stanford Human-Centered AI Institute (HAI), the Toyota Research Institute (TRI), Innopeak, Meta, Bosch, and Samsung.

This website template was borrowed from HyperNeRF.