HUMAN DEMONSTRATION DATA

Human POV & Motion Capture

First-person human video paired with full-body 3D motion capture - the human-demonstration layer for embodied pretraining.

3D SKELETALH.264 · BVH · FBX · JSONLContinuous collection, batch delivery
33
Body landmarks
21/hand
Hand landmarks
3D
World-meter skeleton
MediaPipe
Pose estimator
01Sample

See the data

Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.

Third-person body track, ball toss (33 landmarks)

Representative of the real pose file. pose_data holds 3 coordinates per landmark in world meters.

pose_data.jsonrepresentative
{
  "metadata": { "num_frames": 39, "sample_rate": 5, "num_landmarks": 33, "coords": "world", "units": "meters" },
  "connections": [[0,1],[1,2],[2,3],[3,7],[0,4]],
  "pose_data": [ [ [-0.0349,-0.6541,-0.2357], [-0.0195,-0.6900,-0.2276] ] ]
}

Clip manifest, POV egocentric

Representative. First-person type with a MediaPipe 33-landmark body pose.

manifest.jsonrepresentative
{
  "episode": { "id": "pov_20251112_233840", "type": "pov", "prompt": "Configuring a home router", "duration_sec": 8.0 },
  "video": { "raw": "raw/video.mp4", "codec": "h264", "fps": 24, "resolution": [1280, 720], "frame_count": 192 },
  "pose_3d": { "method": "mediapipe", "keypoints_uri": "raw/body_keypoints_3d.npz",
               "num_landmarks": 33, "num_frames": 192, "coords": "world", "units": "meters" }
}

First-person hand keypoints (21 landmarks per hand)

Representative of the real packed-array shape; 21 landmarks per hand x 3 coordinates, with per-frame visibility.

hand_keypoints_3d.npzrepresentative
left_hand:        float64 [1892, 21, 3]   # XYZ world meters
right_hand:       float64 [1892, 21, 3]
left_visibility:  float64 [1892]          # per-frame confidence
right_visibility: float64 [1892]
landmark_names:   [21]   (WRIST, THUMB_TIP, INDEX_TIP, ...)
connections:      int [23, 2]             # hand skeleton edges
fps:              int
02Schema

Record shape

Every field, its type, whether it can be null, and a representative value.

FieldTypeConstraintDescription
episode.idstringrequiredClip identifier.
e.g. syn_third_20251119_051421
episode.typestringrequiredEgocentric (POV) vs external (third-person).
e.g. pov
episode.promptstringnullableTask or activity label.
e.g. Configuring a home router
episode.duration_secfloat · srequiredClip duration.
e.g. 8.0
video.codecstringrequiredVideo codec.
e.g. h264
video.fpsint · fpsrequiredFrame rate.
e.g. 24
video.resolution[w,h] · pxrequiredFrame size.
e.g. [1280, 720]
video.frame_countint · framesrequiredTotal frames.
e.g. 192
pose_3d.methodstringrequiredPose estimator.
e.g. mediapipe
pose_3d.num_landmarksintrequiredBody landmark count (33 body, 21 per hand).
e.g. 33
pose_3d.coordsstringrequiredCoordinate frame.
e.g. world
pose_3d.unitsstringrequiredSpatial units.
e.g. meters
pose_datafloat[F][L][3] · mrequiredPer-frame, per-landmark XYZ world coordinates.
e.g. [[[-0.0349,-0.6541,-0.2357], ...]]
connectionsint[][2]requiredSkeleton edge list (landmark index pairs).
e.g. [[0,1],[1,2],[2,3]]
03What's included

First-Person POV

Egocentric human video of real tasks - the visual prior for embodied perception and affordance learning.

Full-Body Motion Capture

Rokoko-suit 3D skeletal tracking with joint-level body landmarks, time-synced to video.

Multi-View Sync

First-person and external viewpoints aligned to the same timeline for cross-view learning.

04Methodology

How it is built

  1. 01

    Egocentric and external capture

    Human demonstrations are captured as first-person (POV) and external (third-person) video of real tasks in H.264.

  2. 02

    Pose estimation

    MediaPipe extracts 3D skeletal keypoints: 33 body landmarks for the full body and 21 landmarks per hand for fine manipulation, in world coordinates in meters.

  3. 03

    Skeleton structuring

    Keypoints are stored as per-frame XYZ arrays with an explicit connection edge list defining the skeleton graph, plus per-landmark visibility.

  4. 04

    AI enrichment

    A vision-language model auto-generates the task name, description, objects present, and actions, with timestamped narration for language-conditioned policies.

  5. 05

    Preview and packaging

    Each clip is rendered to a preview GIF with tracking overlay and an annotated pose frame, bundled with the raw video, pose JSON, packed keypoints, and a manifest.

  6. 06

    Public sample surfacing

    Real human-POV clips are kept as their own line, served from storage object URLs, separate from any synthetic or generated POV.

05Evals

How we validate

What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.

Pose tracking confidence

Measures

Whether extracted skeletal keypoints are reliable per frame.

Method

Per-landmark visibility scores are stored alongside keypoints and tracked as a QA metric.

Result

Methodology-stage. Per-frame visibility is recorded; no published aggregate accuracy.

Skeleton structural validity

Measures

Whether keypoint arrays conform to the expected landmark count and connection graph.

Method

Validate the frame x landmark x 3 array shape against the declared landmark count (33 body, 21 hand) and the connection edge list.

Result

Methodology-stage. Shape and consistency check.

AI annotation QA

Measures

Whether the auto-generated task, object, and narration labels are correct.

Method

Generated metadata is reviewed in admin QA before customer visibility.

Result

Methodology-stage. Human review gate; no scored metric.

06Graders

Ground truth

What correct means for this data, and how it is established.

Ground truth

The human actual demonstrated motion is the ground truth, and the MediaPipe 3D skeleton plus 21-landmark-per-hand track is its structured representation. Task and object labels are graded against human-reviewed annotation.

How it is established

Per-frame visibility scores flag low-confidence keypoints, admin human QA reviews enrichment outputs, and structural validity is checked against the declared landmark and connection schema.

Agreement

No inter-rater or keypoint-accuracy agreement statistic is published.

07Application

Foundation Model Pretraining

Large-scale egocentric video and motion for visual and spatial pretraining of embodied models.

Human-to-Robot Transfer

3D human motion mapped toward robot morphologies for retargeting and demonstration transfer.

Human-Robot Interaction

Multi-perspective human behavior with pose annotations for interaction and intent modeling.

08Environment & integration

How you load it

Delivery

S3, REST API, CDN, Public preview URLs

Formats

H.264 MP4, JSON pose, NPZ packed keypoints, GIF / PNG previews, JSON manifest

Auth

The public sample tier is open via preview URLs; the full dataset is behind org-scoped API keys with tenant isolation.

Cadence

Continuous collection with batch delivery; 26 accepted human-POV previews are currently public.

quickstart.sh
# Load a human-POV clip and its 3D skeleton
# 1. Resolve a sample's urls (video, pose_data, keypoints_3d, preview_gif)
# 2. Download video.mp4 and load body_keypoints_3d.npz with NumPy
# 3. Read pose_data [F, 33, 3] + connections to render the skeleton

Request access.

Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.