Research
Engineering··10 min read

LeRobot v2.0, why we ship it natively, and what we extended

TL;DR

We ship LeRobot v2.0 natively and extended it with force channels, object pose registry, and temporal alignment for multi-clock sensor fusion. Here is what broke at scale and how we fixed it.

01

Why LeRobot, and why standardization matters now

Robot learning dataset formats have historically been an afterthought. Datasets were shared as lab-specific pickle files, proprietary HDF5 schemas, or ad-hoc ROS bag collections. Cross-lab reproducibility was essentially impossible without significant data engineering investment. LeRobot [1] changed this by proposing a format built on Parquet (structured observations), MP4 (video streams), and JSON (episode metadata) — all standard, widely-supported formats with excellent tooling.

We chose LeRobot for a specific reason: interoperability with the Hugging Face ecosystem and, by extension, with the growing number of VLA fine-tuning pipelines that consume it. Producing a proprietary format would have required downstream users to implement a converter before they could use our data — a friction that would meaningfully reduce adoption. Native LeRobot means our data plugs directly into existing pipelines.

The decision was not without cost. LeRobot v2.0’s design assumptions are optimized for single-camera, single-arm, 30Hz demonstrations. Our collection infrastructure operates at 1kHz for force sensing, with multiple synchronized cameras, a 6-axis F/T sensor, IMU, and per-object pose tracking. Fitting this into LeRobot required extensions.

02

The contact-force channel

Force-torque data at 1kHz is the first structural challenge. A 30-second episode produces 30,000 F/T samples across 6 axes — 180,000 floating-point values per episode. At scale, raw F/T storage adds roughly 3× overhead compared to a vision-only episode (depending on camera resolution and frame rate). This is acceptable in principle, but it becomes unacceptable when combined with the observation that F/T signals during manipulation are highly compressible.

During free-space motion, F/T readings are dominated by gravity compensation and inertial loads — predictable, smooth signals that differ primarily in their slowly varying envelopes. During contact events, the signal is step-like and then noisy. Neither regime is well-served by standard floating-point Parquet storage, which achieves at best 2× compression on these signals.

Delta encoding for F/T streams

We implemented a delta-encoding scheme: for each F/T channel, we store the first value as a float32 anchor and subsequent values as int16 differences scaled by a per-episode quantization factor. The quantization factor is chosen to preserve sub-millinewton resolution across the dynamic range of the episode. This achieves 8.2× compression on our benchmark episodes with reconstruction error below the sensor noise floor (0.3mN on our ATI Mini45 sensors).

Free-space motion
11.2×
Pick-and-place
8.4×
Precision insertion
6.1×
Extended contact hold
4.3×
Fig. 1 — Compression ratio by episode type. 'High-contact' episodes (e.g., precision assembly) are less compressible than 'low-contact' episodes (e.g., pick-and-place with brief contacts). Average across 5,000 episodes.

The F/T channel is stored as a separate Parquet column group named observation.force_torque with a companion metadata field force_torque.hz specifying the sampling rate. This is the only structural addition to the Parquet schema — everything else is in existing column groups.

03

The 6-DoF object pose registry

Training data for manipulation tasks benefits enormously from known object poses: policies can condition on pose-relative actions (grasp 5cm above the object centroid, approach at 45°) rather than absolute workspace coordinates. This dramatically improves sample efficiency and generalization to novel object positions.

We track objects using a hybrid approach: ArUco markers [2] are placed on objects for quick initialization, and FoundationPose [3] provides continuous 6-DoF tracking without markers once initialized. Object poses are stored in a separate JSON file per episode (object_poses.json) with the following schema:

object_poses.json schema
{
  "objects": [
    {
      "id": "nut_m6_hex",
      "mesh_asset": "assets/nut_m6_hex.obj",
      "poses": [
        {
          "frame_idx": 0,
          "translation": [x, y, z],   // meters, robot base frame
          "rotation": [qw, qx, qy, qz] // unit quaternion
        }
        // ... one entry per video frame
      ],
      "tracker_confidence": [0.99, 0.98, ...],
      "aruco_initialized": true
    }
  ],
  "coordinate_frame": "robot_base",
  "units": "meters"
}

Storing poses at video frame rate (30Hz) rather than F/T rate (1kHz) reflects a deliberate decision: object geometry changes slowly relative to force dynamics, and FoundationPose at 1kHz would require GPU resources we cannot afford per-episode at scale. The 30Hz rate is sufficient for the downstream applications we target.

04

Soft temporal alignment across sensor clocks

The hardest engineering problem in multi-sensor robot data is clock synchronization. Our sensor stack includes: (1) two RGB cameras timestamped by their internal clock, (2) an F/T sensor with a separate hardware clock, (3) robot joint state from the controller at 500Hz, and (4) IMU at 200Hz. These clocks drift relative to each other at rates between 1ppm (for disciplined NTP sources) and 50ppm (for undisciplined embedded clocks). Over a 30-second episode, 50ppm drift is 1.5ms — which at 1kHz F/T sampling is 1.5 samples. This is noticeable.

Hardware PTP (Precision Time Protocol) would solve this at the cost of adding PTP-capable hardware to every sensor in our stack. We chose a software approach: we record a brief synchronization pulse at the start of each episode (a wrist tap that is simultaneously visible in cameras, detectable in F/T, and logged by the robot controller), fit a quadratic drift model to align sensor clocks, and store the drift coefficients per episode for use during loading.

Implementation note

The sync pulse approach costs one second of episode time and requires no additional hardware. For episodes longer than two minutes, we add a second sync pulse midway. Clock drift is modeled as linear over 30-second windows; quadratic over longer durations.

The temporal alignment coefficients are stored in the episode metadata JSON as sync.drift_coefficients, one set per sensor pair. The LeRobot dataloader is patched to apply these coefficients during sample construction, so downstream users see a temporally aligned dataset without needing to implement alignment themselves.

05

What we deliberately did not extend

Several obvious extensions were considered and rejected. Audio: wrist-mounted microphones can detect contact events (the acoustic signature of fingertip contact is distinctive), but the signal is heavily lab-environment-dependent and we have not validated audio-conditioned policies. We collect it but do not include it in the standard export.

Depth video: our cameras include depth streams at 15Hz. We chose not to include these by default because (1) the F/T data serves the same downstream purpose (contact detection) more reliably, and (2) depth video storage costs are substantial. We offer depth export as an opt-in flag.

In-hand tactile: we instrument a subset of gripper fingers with DIGIT [4]tactile sensors, but these produce image streams (60fps, 320×240) that add 2.5× storage overhead. Tactile streams are stored in a separate LeRobot-compatible sidecar dataset and linked via episode ID, rather than in the main dataset.

The design philosophy throughout has been: add to the standard format what every downstream user needs by default; make expensive-to-store modalities opt-in or linked sidecars. LeRobot’s value is its interoperability — every extension that diverges from the standard format erodes that value. We tried to minimize divergence while enabling the contact-aware training we need.

06

References

  1. [1]
    Cadène, R. et al. (2024). LeRobot: State-of-the-art machine learning for real-world robotics. Hugging Face 2024.
  2. [2]
    Garrido-Jurado, S. et al. (2014). Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47(6) 2014.
  3. [3]
    Wen, B. et al. (2024). FoundationPose: Unified 6D pose estimation and tracking of novel objects. CVPR 2024.
  4. [4]
    Lambeta, M. et al. (2020). DIGIT: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. RA-L 2020.