GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

1University of Central Florida, 2Seoul National University
Interpolation end reference image.

Given an object mesh and its relative position, GraspDiffusion is capable of generating a human-object interaction scene

Abstract

Recent generative models can synthesize high-quality images but often fail to generate humans interacting with objects using their hands. This arises mostly from the model's misunderstanding of such interactions, and the hardships of synthesizing intricate regions of the human body.

In this paper, we propose GraspDiffusion, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object mesh, GraspDiffusion first constructs life-like whole-body poses with control over the object's location relative to the human body. This is achieved by separately leveraging the generative priors for 3D body poses and hand poses, optimizing them into a joint grasping pose. The resulting pose guides the image synthesis to correctly reflect the intended interaction, allowing the creation of realistic and diverse human-object interaction scenes.

We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions, while also outperforming previous methods.

Approach

Starting with a 3D object mesh and its position within the human-centric coordinate system (originating at the pelvis joint), GraspDiffusion synthesizes realistic images portraying a human interacting with the object, with a significant portion of the human body visible.

Interpolation end reference image.

Full-body Grasping

Compared to similar research, our method focuses on grabbing objects positioned on diverse locations relative to the human, capturing the wide range of possible grasping poses. We separately generate the hand pose and the body pose, which are jointly optimized to create a full-bodied 3D human model grasping an object.

Interpolation end reference image.

Scene Generation

From the generated 3D grasping pose, we extract multiple geometric poses that serves as guidance for creating a detailed, realistic scene of a human interacting with the given object. We use a series of spatial encoders and an attention-injection scheme to correctly facilitate plausible interaction.

Interpolation end reference image.

Datasets

To address the shortage of realistic images paired with 3D annotation, we designed an annotation pipeline through which we leveraged previous interaction datasets to function as a pseudo-3D interaction dataset.

Interpolation end reference image.

Realistic interaction generation

Only requiring minimal inputs (an object mesh model and its relative location), GraspDiffusion can generate a wide-range of plausible 3D grasping poses and realistic human-object interaction images.

Interpolation end reference image.

We compare generated human-object interaction images generated by different methods using the same input object. While other methods display erroneous interactions (e.g. multiple objects, object appearance distorted, color blending), our pipeline can correctly convey the intention of the human-object grasping pipeline.

Interpolation end reference image.

Additional Samples

Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7 Image 8 Image 9 Image 10 Image 9 Image 10

Initiated from Object Image

Interpolation end reference image.

Multiple Style Support

Interpolation end reference image.

BibTeX

@InProceedings{Kwon_2026_WACV,
    author    = {Kwon, Patrick and Chen, Chen and Joo, Hanbyul},
    title     = {GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    month     = {March},
    year      = {2026},
}