Naturally controllable human-scene interaction (HSI) generation has an important role in various fields, such as VR/AR content creation and human-centered AI. However, existing methods are unnatural and unintuitive in their controllability, which heavily limits their application in practice. Therefore, we focus on a challenging task of naturally and controllably generating realistic and diverse HSIs from textual descriptions. From human cognition, the ideal generative model should correctly reason about spatial relationships and interactive actions. To that end, we propose Narrator, a novel relationship reasoning-based generative approach using a conditional variation autoencoder for naturally controllable generation given a 3D scene and a textual description. Also, we model global and local spatial relationships in a 3D scene and a textual description respectively based on the scene graph, and introduce a partlevel action mechanism to represent interactions as atomic body part states. In particular, benefiting from our relationship reasoning, we further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation. Our extensive experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works. The code and dataset will be available for research purposes.
Teaser
Given a textual description, our approach can naturally and controllably generate semantically consistent and physically plausible human-scene interactions for various cases: (a) interactions guided by spatial relationships, (b) interactions guided by multiple actions, (c) multi-human scene interactions, and (d) human-scene interactions combining the above interaction types, which cannot be generated using prior works.
Framework Overview
Overview of the proposed Narrator framework. Given a scene and a textual description, multi-modal features including scene features, scene graph features, and action features are extracted (a), where the latter two are reasoned through our Joint Global and Local Scene Graph (b) and Part-Level Action (c), respectively. These features are then concatenated as a joint conditional embedding and fed into the transformer-based cVAE framework for human-scene interaction (d).
Results
Qualitative comparison of interactions generated with our approach and three baselines. We present different textual queries in columns and different methods in rows. Overall, our interaction generations are semantically more consistent with textual descriptions and physically more realistic with scene interactions.
@misc{xuan2023narrator,
title={Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning},
author={Haibiao Xuan and Xiongzheng Li and Jinsong Zhang and Hongwen Zhang and Yebin Liu and Kun Li},
year={2023},
eprint={2303.09410},
archivePrefix={arXiv},
primaryClass={cs.CV}
}