HUMAN DEMONSTRATION DATA
Human POV & Motion Capture
First-person human video paired with full-body 3D motion capture - the human-demonstration layer for embodied pretraining.
See the data
Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.
Third-person body track, ball toss (33 landmarks)
Representative of the real pose file. pose_data holds 3 coordinates per landmark in world meters.
{
"metadata": { "num_frames": 39, "sample_rate": 5, "num_landmarks": 33, "coords": "world", "units": "meters" },
"connections": [[0,1],[1,2],[2,3],[3,7],[0,4]],
"pose_data": [ [ [-0.0349,-0.6541,-0.2357], [-0.0195,-0.6900,-0.2276] ] ]
}Clip manifest, POV egocentric
Representative. First-person type with a MediaPipe 33-landmark body pose.
{
"episode": { "id": "pov_20251112_233840", "type": "pov", "prompt": "Configuring a home router", "duration_sec": 8.0 },
"video": { "raw": "raw/video.mp4", "codec": "h264", "fps": 24, "resolution": [1280, 720], "frame_count": 192 },
"pose_3d": { "method": "mediapipe", "keypoints_uri": "raw/body_keypoints_3d.npz",
"num_landmarks": 33, "num_frames": 192, "coords": "world", "units": "meters" }
}First-person hand keypoints (21 landmarks per hand)
Representative of the real packed-array shape; 21 landmarks per hand x 3 coordinates, with per-frame visibility.
left_hand: float64 [1892, 21, 3] # XYZ world meters right_hand: float64 [1892, 21, 3] left_visibility: float64 [1892] # per-frame confidence right_visibility: float64 [1892] landmark_names: [21] (WRIST, THUMB_TIP, INDEX_TIP, ...) connections: int [23, 2] # hand skeleton edges fps: int
Record shape
Every field, its type, whether it can be null, and a representative value.
| Field | Type | Constraint | Description |
|---|---|---|---|
| episode.id | string | required | Clip identifier. e.g. syn_third_20251119_051421 |
| episode.type | string | required | Egocentric (POV) vs external (third-person). e.g. pov |
| episode.prompt | string | nullable | Task or activity label. e.g. Configuring a home router |
| episode.duration_sec | float · s | required | Clip duration. e.g. 8.0 |
| video.codec | string | required | Video codec. e.g. h264 |
| video.fps | int · fps | required | Frame rate. e.g. 24 |
| video.resolution | [w,h] · px | required | Frame size. e.g. [1280, 720] |
| video.frame_count | int · frames | required | Total frames. e.g. 192 |
| pose_3d.method | string | required | Pose estimator. e.g. mediapipe |
| pose_3d.num_landmarks | int | required | Body landmark count (33 body, 21 per hand). e.g. 33 |
| pose_3d.coords | string | required | Coordinate frame. e.g. world |
| pose_3d.units | string | required | Spatial units. e.g. meters |
| pose_data | float[F][L][3] · m | required | Per-frame, per-landmark XYZ world coordinates. e.g. [[[-0.0349,-0.6541,-0.2357], ...]] |
| connections | int[][2] | required | Skeleton edge list (landmark index pairs). e.g. [[0,1],[1,2],[2,3]] |
First-Person POV
Egocentric human video of real tasks - the visual prior for embodied perception and affordance learning.
Full-Body Motion Capture
Rokoko-suit 3D skeletal tracking with joint-level body landmarks, time-synced to video.
Multi-View Sync
First-person and external viewpoints aligned to the same timeline for cross-view learning.
How it is built
- 01
Egocentric and external capture
Human demonstrations are captured as first-person (POV) and external (third-person) video of real tasks in H.264.
- 02
Pose estimation
MediaPipe extracts 3D skeletal keypoints: 33 body landmarks for the full body and 21 landmarks per hand for fine manipulation, in world coordinates in meters.
- 03
Skeleton structuring
Keypoints are stored as per-frame XYZ arrays with an explicit connection edge list defining the skeleton graph, plus per-landmark visibility.
- 04
AI enrichment
A vision-language model auto-generates the task name, description, objects present, and actions, with timestamped narration for language-conditioned policies.
- 05
Preview and packaging
Each clip is rendered to a preview GIF with tracking overlay and an annotated pose frame, bundled with the raw video, pose JSON, packed keypoints, and a manifest.
- 06
Public sample surfacing
Real human-POV clips are kept as their own line, served from storage object URLs, separate from any synthetic or generated POV.
How we validate
What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.
Pose tracking confidence
Measures
Whether extracted skeletal keypoints are reliable per frame.
Method
Per-landmark visibility scores are stored alongside keypoints and tracked as a QA metric.
Result
Methodology-stage. Per-frame visibility is recorded; no published aggregate accuracy.
Skeleton structural validity
Measures
Whether keypoint arrays conform to the expected landmark count and connection graph.
Method
Validate the frame x landmark x 3 array shape against the declared landmark count (33 body, 21 hand) and the connection edge list.
Result
Methodology-stage. Shape and consistency check.
AI annotation QA
Measures
Whether the auto-generated task, object, and narration labels are correct.
Method
Generated metadata is reviewed in admin QA before customer visibility.
Result
Methodology-stage. Human review gate; no scored metric.
Ground truth
What correct means for this data, and how it is established.
Ground truth
The human actual demonstrated motion is the ground truth, and the MediaPipe 3D skeleton plus 21-landmark-per-hand track is its structured representation. Task and object labels are graded against human-reviewed annotation.
How it is established
Per-frame visibility scores flag low-confidence keypoints, admin human QA reviews enrichment outputs, and structural validity is checked against the declared landmark and connection schema.
Agreement
No inter-rater or keypoint-accuracy agreement statistic is published.
Foundation Model Pretraining
Large-scale egocentric video and motion for visual and spatial pretraining of embodied models.
Human-to-Robot Transfer
3D human motion mapped toward robot morphologies for retargeting and demonstration transfer.
Human-Robot Interaction
Multi-perspective human behavior with pose annotations for interaction and intent modeling.
How you load it
Delivery
S3, REST API, CDN, Public preview URLs
Formats
H.264 MP4, JSON pose, NPZ packed keypoints, GIF / PNG previews, JSON manifest
Auth
The public sample tier is open via preview URLs; the full dataset is behind org-scoped API keys with tenant isolation.
Cadence
Continuous collection with batch delivery; 26 accepted human-POV previews are currently public.
# Load a human-POV clip and its 3D skeleton# 1. Resolve a sample's urls (video, pose_data, keypoints_3d, preview_gif)# 2. Download video.mp4 and load body_keypoints_3d.npz with NumPy# 3. Read pose_data [F, 33, 3] + connections to render the skeleton
Request access.
Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.