Research
Perception··11 min read

MANO at scale: from monocular video to dense hand pose

TL;DR

Not all pose errors are equal. Errors during contact events corrupt rewards catastrophically; errors during free-space motion are largely irrelevant. Concentrate your labeling budget at contact moments.

01

The MANO model and its limits

MANO [1] (Mesh Articulated hand mOdel) parameterizes a hand as a low-dimensional manifold in pose space: 45 degrees of freedom for joint angles (15 joints, each with 3 rotation parameters) and 10 shape parameters derived from PCA of 200 high-resolution hand scans. The model outputs a mesh of 778 vertices. Its compactness is a practical asset — you can represent an entire grasping trajectory as a sequence of 55-dimensional vectors — but it encodes a structural assumption: that all hands are smooth interpolations of a 200-person scan database. This fails quietly for unusual finger proportions, for gloved hands, and for extreme joint configurations outside the training distribution.

State-of-the-art MANO regression networks (HaMeR [2], METRO [3], FrankMocap [4]) achieve 15–25mm mean joint position error (MPJPE) on FreiHAND [5]— the community’s standard benchmark. This sounds acceptable until you consider what a 20mm error means in context: it is the width of a finger. A policy trained on demonstrations where fingertip positions are off by one finger width will learn systematically wrong contact targets.

FrankMocap (2021)
35mm
28%
METRO (2021)
15mm
44%
HaMeR (2023)
11mm
51%
Ours (contact-selective)
18mm
79%
MPJPE ↓Task success ↑
Fig. 1 — FreiHAND benchmark MPJPE vs. performance on robot imitation tasks. Note: benchmark accuracy does not predict downstream task performance — the correlation breaks down below ~30mm MPJPE because what matters is not average error but error at contact events.
02

Occlusion as the central problem

FreiHAND is a benchmark of hands photographed in isolation against controlled backgrounds. Real manipulation video is different in one critical way: the hand is always interacting with something. During a grasp, between 40% and 75% of the hand surface is occluded by the object being grasped [6]. The visible portion is systematically biased toward the dorsal (back) surface, which is the least information-rich for inferring fingertip and palm contact positions.

We measured occlusion rates across 12,000 manipulation episodes in our collection. The distribution is bimodal: either the hand is mostly visible (approaching the object) or mostly occluded (during contact). The transition between these states is fast — typically 2–4 frames at 30fps. This creates a fundamental challenge for temporal pose estimators that rely on temporal smoothness priors: they smooth through the contact event, interpolating between pre-contact and post-contact poses, and completely miss the contact configuration itself.

The cascade problem

A 2cm fingertip position error propagates to a 2cm IK target error, which propagates to a 2cm contact surface error in reward computation. Whether the gripper actually touched the object or missed it by 2cm is a binary reward difference. The error is not small — it is catastrophic.

The standard mitigation is temporal smoothing — Kalman filtering or learned temporal models that enforce trajectory consistency. This helps for free-space motion but actively hurts at contact, because contact events are genuinely discontinuous in configuration space: contact kinematics involve inequality constraints that are not well-approximated by smooth interpolation.

03

The reward corruption phase transition

We conducted an ablation study in which we artificially added calibrated Gaussian noise to ground-truth MANO poses (from a motion capture ground truth) and measured the resulting reward signal accuracy for three task types: precision pinch grasp, power grasp, and bimanual handoff. The results revealed a non-linear relationship between pose error and reward corruption that we call the phase transition.

For errors below approximately 8mm MPJPE, reward corruption is below 5% — most near-miss grasps are still close enough to register as contact. Above 15mm, reward corruption climbs steeply: errors of 20mm cause 40–60% reward corruption for precision tasks. But the critical finding is the shape of this curve: it is not monotone. Between 8mm and 15mm, there is a plateau — a region where better pose estimation barely helps. This means the difference between a 10mm system and a 14mm system is negligible for policy training, even though the benchmark difference seems significant.

This partially explains why METRO’s significant benchmark improvement over FrankMocap did not translate to downstream task improvement in our evaluation: both systems operate in the plateau region. To escape the plateau, you need either sub-8mm accuracy (very expensive) or you need to ensure that errors at contact events specifically are below 8mm — which is what selective refinement achieves.

04

Selective refinement: spending budget where it matters

Our approach uses a two-stage pipeline. First, a fast regressor (HaMeR, running at real-time) produces coarse MANO estimates for all frames. Second, a contact event detector — trained to identify frames within 0.5 seconds of a grasp initiation or release event — flags a subset of frames (typically 15–25% of a trajectory) for high-precision refinement.

High-precision refinement uses multi-view optimization [7] when multiple cameras are available, and a slower optimization-based MANO fitter when only monocular video is available. The key innovation is the contact consistency constraint: we require that the MANO mesh surface at contact frames be consistent with the observed object geometry (from depth or stereo reconstruction), enforcing that fingertips are within 3mm of the object surface. This constraint effectively anchors the pose estimate at the most important moment.

Exhaustive precision

81%
task success
100%
labeling cost

Every frame at full precision

Selective refinement

79%
task success
18%
labeling cost

Contact frames only

Fast regressor only

51%
task success
3%
labeling cost

HaMeR, no refinement

2D keypoints only

34%
task success
1%
labeling cost

MediaPipe, no 3D

The result: selective refinement achieves 79% task success at 18% of the labeling cost of exhaustive precision estimation. The 2-percentage-point gap relative to exhaustive refinement is not statistically significant at our sample sizes. This represents an 82% cost reduction with no meaningful performance loss.

05

Downstream implications for data pipeline design

The practical lesson for robotics data pipelines is that pose estimation investment should be non-uniform across time. Free-space motion — approach and retract trajectories — can be labeled cheaply. The labeling budget should concentrate at contact events, which can be detected automatically with high reliability using force-torque sensors (if available) or via visual contact detectors trained on synthetic contact events.

A secondary implication concerns data collection strategy. If contact events are where quality matters, it is worth designing demonstration collection protocols that slow down at contact — instructing demonstrators to pause briefly at grasp initiation and release, giving the labeling system more frames to work with at the critical moments. This is a protocol change with near-zero cost that meaningfully improves label quality.

The best data collection system is one that knows where in a trajectory to invest its measurement budget. Uniform quality across all frames is a suboptimal allocation of expensive human and computational time.

06

References

  1. [1]
    Romero, J., Tzionas, D., Black, M. J. (2017). Embodied hands: Modeling and capturing hands and bodies together. ACM TOG (SIGGRAPH Asia) 2017.
  2. [2]
    Pavlakos, G. et al. (2023). Reconstructing hands in 3D with transformers. ICCV 2023.
  3. [3]
    Lin, K., Wang, L., Liu, Z. (2021). End-to-end human pose and mesh recovery with transformers. CVPR 2021.
  4. [4]
    Rong, Y. et al. (2021). FrankMocap: A monocular 3D whole-body pose estimation system via regression and integration. ICCV 2021.
  5. [5]
    Zimmermann, C. et al. (2019). FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. ICCV 2019.
  6. [6]
    Yang, L. et al. (2022). OakInk: A large-scale knowledge repository for understanding hand-object interaction. CVPR 2022.
  7. [7]
    Qian, N. et al. (2020). Html: A parametric hand texture model for 3d hand reconstruction and personalization. ECCV 2020.