Research Note
Embodied Data Collection Methodology
Definition
The methodological framework for collecting training data for Physical AI / Embodied AI. The core question: how to acquire high-quality robot manipulation data — especially dexterous hand data — in a scalable manner.
Data Pyramid (Traditional Framework)
┌──────────────────┐
│ Real robot data │ ← Highest quality, but expensive/slow/hard to scale
├──────────────────┤
│ Simulation data │ ← High volume, but sim-to-real gap
├──────────────────┤
│ Internet video │ ← Largest volume, but hard to learn from, no tactile
└──────────────────┘
Core Path Evolution
1. Simulation
- Representatives: OpenAI Dactyl (Rubik’s cube), ADR
- Advantage: nearly unlimited data
- Bottleneck: heavy physics modeling dependence; fails on unmodeled contact patterns
- Improvements: ExoStart, Bi-DexHands, Visual Dexterity
- Verdict: has a ceiling; dexterous hands must ultimately return to real-world data
2. Teleoperation
- Essence: human real-time control of robot for demonstration recording
- Three routes: vision (DexPilot→AnyTeleop→HoloDex), glove (Manus+Vive), exoskeleton (HexoTrac/MILE/DOGlove)
- Fundamental limitation: requires robot presence, inherently limits scale-up
3. In-the-wild Body-free Collection (Most Promising)
- Core idea: no robot needed; collect data from natural human activities
- Two-finger gripper solved: → large-scale gripper datasets
- Dexterous hand exoskeleton three forms:
- Under-hand : isomorphic, low embodiment gap, but tied to specific end-effector
- Mid-hand : conforms to fingers, custom exoskeleton per robot hand
- Over-hand an embodied AI startup: pursues generality, 21 joints + 400 tactile points, low cost but high algorithm difficulty
- Data glove route: DexWild (EMF + ArUco), expensive (~tens of thousands RMB per glove)
4. Human Video Data (Future Goldmine)
- Representatives: EgoMimic / EgoDex / EgoScale
- Near-infinite volume, but severe noise: self-occlusion, missing tactile, embodiment gap
Core Trade-off
Data utility vs. Scale-up potential
Easier to scale → hardware burden ↓ but algorithm burden ↑. This is the fundamental tension in dexterous hand data collection.
Open Questions (public version)
- How to efficiently “distill” manipulation knowledge from human video to robots?
- Can human hand data beyond current robot capabilities be reused by future higher-DoF hands?
- What format should dexterous hand data use? Industry has no consensus.
- Tactile data standardization: resistive vs. visuotactile vs. other?
- Will world models (e.g., Veo 3) reduce demand for collected data? (team concern)
- Sim2real brute-force works for locomotion — can it work for manipulation too?
Sources
- (third-party industry article — not redistributed here)
- (internal team-discussion notes — private)