Research Note

Embodied Data Collection Methodology

Zhenyu He · Jobs Stroustrup 2 min read

Definition

The methodological framework for collecting training data for Physical AI / Embodied AI. The core question: how to acquire high-quality robot manipulation data — especially dexterous hand data — in a scalable manner.

Data Pyramid (Traditional Framework)

         ┌──────────────────┐
         │ Real robot data   │ ← Highest quality, but expensive/slow/hard to scale
         ├──────────────────┤
         │ Simulation data   │ ← High volume, but sim-to-real gap
         ├──────────────────┤
         │ Internet video    │ ← Largest volume, but hard to learn from, no tactile
         └──────────────────┘

Core Path Evolution

1. Simulation

Representatives: OpenAI Dactyl (Rubik’s cube), ADR
Advantage: nearly unlimited data
Bottleneck: heavy physics modeling dependence; fails on unmodeled contact patterns
Improvements: ExoStart, Bi-DexHands, Visual Dexterity
Verdict: has a ceiling; dexterous hands must ultimately return to real-world data

2. Teleoperation

Essence: human real-time control of robot for demonstration recording
Three routes: vision (DexPilot→AnyTeleop→HoloDex), glove (Manus+Vive), exoskeleton (HexoTrac/MILE/DOGlove)
Fundamental limitation: requires robot presence, inherently limits scale-up

3. In-the-wild Body-free Collection (Most Promising)

Core idea: no robot needed; collect data from natural human activities
Two-finger gripper solved: → large-scale gripper datasets
Dexterous hand exoskeleton three forms:
- Under-hand : isomorphic, low embodiment gap, but tied to specific end-effector
- Mid-hand : conforms to fingers, custom exoskeleton per robot hand
- Over-hand an embodied AI startup: pursues generality, 21 joints + 400 tactile points, low cost but high algorithm difficulty
Data glove route: DexWild (EMF + ArUco), expensive (~tens of thousands RMB per glove)

4. Human Video Data (Future Goldmine)

Representatives: EgoMimic / EgoDex / EgoScale
Near-infinite volume, but severe noise: self-occlusion, missing tactile, embodiment gap

Core Trade-off

Data utility vs. Scale-up potential

Easier to scale → hardware burden ↓ but algorithm burden ↑. This is the fundamental tension in dexterous hand data collection.

Open Questions (public version)

How to efficiently “distill” manipulation knowledge from human video to robots?
Can human hand data beyond current robot capabilities be reused by future higher-DoF hands?
What format should dexterous hand data use? Industry has no consensus.
Tactile data standardization: resistive vs. visuotactile vs. other?
Will world models (e.g., Veo 3) reduce demand for collected data? (team concern)
Sim2real brute-force works for locomotion — can it work for manipulation too?

Sources

(third-party industry article — not redistributed here)
(internal team-discussion notes — private)