Recent generative models can synthesize high-quality images but often fail to generate humans interacting with objects using their hands. This arises mostly from the model's misunderstanding of such interactions, and the hardships of synthesizing intricate regions of the human body.
In this paper, we propose GraspDiffusion, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object mesh, GraspDiffusion first constructs life-like whole-body poses with control over the object's location relative to the human body. This is achieved by separately leveraging the generative priors for 3D body poses and hand poses, optimizing them into a joint grasping pose. The resulting pose guides the image synthesis to correctly reflect the intended interaction, allowing the creation of realistic and diverse human-object interaction scenes.
We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions, while also outperforming previous methods.
Starting with a 3D object mesh and its position within the human-centric coordinate system (originating at the pelvis joint), GraspDiffusion synthesizes realistic images portraying a human interacting with the object, with a significant portion of the human body visible.
Compared to similar research, our method focuses on grabbing objects positioned on diverse locations relative to the human, capturing the wide range of possible grasping poses. We separately generate the hand pose and the body pose, which are jointly optimized to create a full-bodied 3D human model grasping an object.
From the generated 3D grasping pose, we extract multiple geometric poses that serves as guidance for creating a detailed, realistic scene of a human interacting with the given object. We use a series of spatial encoders and an attention-injection scheme to correctly facilitate plausible interaction.
To address the shortage of realistic images paired with 3D annotation, we designed an annotation pipeline through which we leveraged previous interaction datasets to function as a pseudo-3D interaction dataset.
Only requiring minimal inputs (an object mesh model and its relative location), GraspDiffusion can generate a wide-range of plausible 3D grasping poses and realistic human-object interaction images.
We compare generated human-object interaction images generated by different methods using the same input object. While other methods display erroneous interactions (e.g. multiple objects, object appearance distorted, color blending), our pipeline can correctly convey the intention of the human-object grasping pipeline.
@InProceedings{Kwon_2026_WACV, author = {Kwon, Patrick and Chen, Chen and Joo, Hanbyul}, title = {GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, }