TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

1SSE, CUHKSZ    2FNii, CUHKSZ
*Indicates Equal Contribution
    §Corresponding Author
Work done as a visiting researcher at CUHKSZ

This paper solely reflects the author’s personal research and is not associated with the author’s affiliated institution

CVPR 2025


arXiv Code (comming soon) Dataset (comming soon)

TASTE-Rob: We present TASTE-Rob, which is able to generate reasonable HOI videos on unseen scenes and support possibility for generalizable robot manipulation.

Abstract

We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob — a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.

Dataset

TASTE-Rob collects 100,856 ego-centric hand-object interaction videos with diverse actions, scenes and objects, which is the first video dataset specifically designed for task-oriented HOI video generation and robotic imitation learning. The dataset will be released soon!

BibTeX


        @inproceedings{zhao2025tasterobadvancingvideogeneration,
          title={TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation}, 
          author={Hongxiang Zhao and Xingchen Liu and Mutian Xu and Yiming Hao and Weikai Chen and Xiaoguang Han},
          year={2025},
          booktitle={CVPR},
        }