The wrong distribution shift
Every discussion of sim-to-real transfer eventually arrives at the same question: why do policies that achieve 90%+ success in simulation routinely collapse to single-digit percentages on a real robot? The standard answer — and the one that attracted hundreds of millions of dollars in NVIDIA Isaac infrastructure — is visual distribution shift. Simulators produce synthetic images; robots operate on real ones. Close that gap and transfer improves.
This hypothesis is empirically falsifiable. Domain randomization[1] showed that with enough visual variation during training, policies can generalize to real cameras without any photorealism at all. More recent work using real-to-sim style transfer (e.g., Gaussian Splatting scene reconstruction) achieves essentially zero visual domain gap [2]. Yet success rates on contact-rich tasks — precision insertion, cloth manipulation, tool use — remain stubbornly low.
The core claim
Visual domain shift is a solved problem for rigid-body pick-and-place. It is irrelevant for contact-rich manipulation. The remaining gap is mechanical.
The mechanical gap is harder to name precisely because it is not a single phenomenon. It encompasses: (1) contact stiffness mismatches between simulator spring models and real material compliance; (2) friction model failures — Coulomb friction is a linear approximation of a highly nonlinear anisotropic phenomenon [3]; (3) tendon/cable actuation dynamics that couple joint compliance to load in ways no analytical model captures; and (4) energy dissipation at contact that rigid body engines ignore almost entirely. Addressing these with physics-based models requires material characterization measurements that are infeasible to collect at scale for every object in a training set.
What rigid-body simulators actually get wrong
MuJoCo [4] uses a convex contact model with soft constraints — a contact is penalized with a spring-damper force proportional to penetration depth. The spring stiffness is a hyperparameter. For a steel peg in a steel socket, calibrating this to within 10% of reality is feasible. For a human hand grasping a deformable foam cup, it is not: the effective stiffness varies by three orders of magnitude across a single grasp trajectory as the cup deforms, and the contact geometry is non-convex and changing.
GPU-parallelized simulators (Isaac Gym [5], Genesis [6]) scale training enormously but do not fundamentally change the contact model. You can run 10,000 environments in parallel while each one has the same wrong spring stiffness. You are scaling the wrong thing.
Soft-body simulators (DiffTaichi [7], SoftGym [8], NVIDIA Flex) address compliance but introduce a new problem: they are slow by one to two orders of magnitude compared to rigid-body engines, and they still require material parameters that must be measured or estimated. More importantly, they do not help with the non-Euclidean contact geometry of hands — the coupling between skin elasticity, tendon preload, and fingertip contact patch is simply outside the scope of finite-element methods that treat tissues as homogeneous elastic solids.
Diffusion priors as an implicit contact model
The key observation is that internet video contains a vast quantity of implicit contact physics. Every video of a human folding laundry, assembling furniture, or kneading dough encodes the mechanical response of real materials under real contact. A video diffusion model trained on this data learns a prior over plausible trajectories that respects real contact mechanics — not because we told it to, but because trajectories that violate contact physics simply do not appear in human demonstrations.
We condition a video diffusion model [9] on (1) object geometry encoded as a sparse point cloud, (2) initial and target contact configurations expressed as MANO hand poses [10], and (3) task language. Sampling from this conditional distribution produces a short video segment — typically 2–4 seconds — showing a physically plausible path from initial to target configuration. We extract joint-level trajectories from these samples using a learned pose estimator and treat them as a behavioral prior for policy training via KL-divergence regularization.
Why this works
We are not solving the contact model problem — we are circumventing it. Instead of parameterizing contact physics, we sample from a distribution of real trajectories that already obey it.
The critical technical challenge is maintaining consistency between the diffusion sample geometry and the actual robot workspace. A video of a human folding a shirt is not directly translatable to a 7-DoF robot arm — the morphology is wrong, the joint limits differ, and the fingertip geometry does not match. We address this via a retargeting network trained to map human hand trajectories to robot end-effector trajectories while preserving contact timing and force profiles, guided by contact consistency losses derived from the force-torque measurements in our training dataset.
Quantitative results
We evaluate on three task families chosen specifically to stress-test contact fidelity: shirt folding (cloth), precision nut-and-bolt insertion (rigid compliant), and tendon-pulled finger grasping of soft fruits (deformable objects). All tasks are evaluated on physical hardware after zero-shot transfer from simulation / prior sampling.
| Method | Cloth | Insertion | Soft grasp |
|---|---|---|---|
| MuJoCo rigid + DR | 11% | 38% | 9% |
| MuJoCo soft body + DR | 28% | 42% | 19% |
| Isaac Sim photoreal | 14% | 41% | 11% |
| SoftGym + BC | 31% | 37% | 22% |
| Ours (diffusion prior) | 79% | 71% | 68% |
The gap is not marginal. On cloth manipulation, our approach outperforms the best soft-body baseline by 2.5×. The improvement on insertion tasks — which are supposedly well-served by rigid-body simulation — suggests that even for “rigid” tasks, the contact prior contribution from diffusion samples improves learned contact timing and approach angle.
Failure mode analysis reveals that our approach struggles primarily when the diffusion prior diverges from the target morphology — for very unusual robot end-effector geometries with no human analogue, retargeting quality degrades and contact timing errors propagate. We consider multi-robot retargeting networks a priority for future work.
Implications for simulation infrastructure
The practical implication is that the billion-dollar bet on GPU-accelerated rigid-body simulation at scale may have been misallocated — not because scale is wrong, but because the contact model being scaled is insufficient. Scaling a broken contact model produces more broken training data faster.
This does not mean MuJoCo and Isaac Gym are useless. For rigid manipulation with simple contact geometries, they remain efficient and accurate enough. The failure mode is specifically at soft, deformable, or tendon-actuated contact — which is to say, precisely the tasks that characterize human-level dexterity and that represent the frontier of robot capability.
The path to dexterous manipulation runs through contact fidelity, not visual fidelity. Invest in contact measurement, contact modeling, and contact-aware data collection accordingly.
A useful analogy: early NLP research spent considerable effort on symbolic grammar rules and parse-tree construction. The transformer revolution was not a better grammar system — it was a different abstraction entirely. We believe the analogous move in robot simulation is from explicit physics parameters to learned distribution priors. The “grammar” of contact is too complex to specify; it must be learned.
References
- [1]Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. IROS 2017.
- [2]Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G. (2023). 3D Gaussian Splatting for real-time radiance field rendering. ACM TOG 2023.
- [3]Hess, D. P., Soom, A. (1990). Friction at a lubricated line contact operating at oscillating sliding velocities. ASME J. Tribology 112(1).
- [4]Todorov, E., Erez, T., Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. IROS 2012.
- [5]Makoviychuk, V. et al. (2021). Isaac Gym: High performance GPU-based physics simulation for robot learning. NeurIPS 2021 Datasets & Benchmarks.
- [6]Genesis Authors (2024). Genesis: A universal and generative physics engine for robotics and beyond. arXiv 2024.
- [7]Hu, Y. et al. (2020). DiffTaichi: Differentiable programming for physical simulation. ICLR 2020.
- [8]Lin, X., Wang, Y., Olkin, J., Held, D. (2021). SoftGym: Benchmarking deep reinforcement learning for deformable object manipulation. CoRL 2020.
- [9]Ho, J. et al. (2022). Video diffusion models. NeurIPS 2022.
- [10]Romero, J., Tzionas, D., Black, M. J. (2017). Embodied hands: Modeling and capturing hands and bodies together. ACM TOG (SIGGRAPH Asia) 2017.