HUMAN DEMONSTRATION DATA

Human POV & Motion Capture

Name: Human POV & Motion Capture
Creator: Gerra

First-person human video paired with full-body 3D motion capture - the human-demonstration layer for embodied pretraining.

3D SKELETALH.264 · BVH · FBX · JSONLContinuous collection, batch delivery

Body landmarks

21/hand

Hand landmarks

World-meter skeleton

MediaPipe

Pose estimator

Sample Schema Included Methodology Evals Graders Application Integration Research

01Sample

See the data

Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.

Third-person body track, ball toss (33 landmarks)

Representative of the real pose file. pose_data holds 3 coordinates per landmark in world meters.

pose_data.jsonrepresentative

{
  "metadata": { "num_frames": 39, "sample_rate": 5, "num_landmarks": 33, "coords": "world", "units": "meters" },
  "connections": [[0,1],[1,2],[2,3],[3,7],[0,4]],
  "pose_data": [ [ [-0.0349,-0.6541,-0.2357], [-0.0195,-0.6900,-0.2276] ] ]
}

Clip manifest, POV egocentric

Representative. First-person type with a MediaPipe 33-landmark body pose.

manifest.jsonrepresentative

{
  "episode": { "id": "pov_20251112_233840", "type": "pov", "prompt": "Configuring a home router", "duration_sec": 8.0 },
  "video": { "raw": "raw/video.mp4", "codec": "h264", "fps": 24, "resolution": [1280, 720], "frame_count": 192 },
  "pose_3d": { "method": "mediapipe", "keypoints_uri": "raw/body_keypoints_3d.npz",
               "num_landmarks": 33, "num_frames": 192, "coords": "world", "units": "meters" }
}

First-person hand keypoints (21 landmarks per hand)

Representative of the real packed-array shape; 21 landmarks per hand x 3 coordinates, with per-frame visibility.

hand_keypoints_3d.npzrepresentative

left_hand:        float64 [1892, 21, 3]   # XYZ world meters
right_hand:       float64 [1892, 21, 3]
left_visibility:  float64 [1892]          # per-frame confidence
right_visibility: float64 [1892]
landmark_names:   [21]   (WRIST, THUMB_TIP, INDEX_TIP, ...)
connections:      int [23, 2]             # hand skeleton edges
fps:              int

02Schema

Record shape

Every field, its type, whether it can be null, and a representative value.

Field	Type	Constraint	Description
episode.id	string	required	Clip identifier. e.g. syn_third_20251119_051421
episode.type	string	required	Egocentric (POV) vs external (third-person). e.g. pov
episode.prompt	string	nullable	Task or activity label. e.g. Configuring a home router
episode.duration_sec	float · s	required	Clip duration. e.g. 8.0
video.codec	string	required	Video codec. e.g. h264
video.fps	int · fps	required	Frame rate. e.g. 24
video.resolution	[w,h] · px	required	Frame size. e.g. [1280, 720]
video.frame_count	int · frames	required	Total frames. e.g. 192
pose_3d.method	string	required	Pose estimator. e.g. mediapipe
pose_3d.num_landmarks	int	required	Body landmark count (33 body, 21 per hand). e.g. 33
pose_3d.coords	string	required	Coordinate frame. e.g. world
pose_3d.units	string	required	Spatial units. e.g. meters
pose_data	float[F][L][3] · m	required	Per-frame, per-landmark XYZ world coordinates. e.g. [[[-0.0349,-0.6541,-0.2357], ...]]
connections	int[][2]	required	Skeleton edge list (landmark index pairs). e.g. [[0,1],[1,2],[2,3]]

03What's included

First-Person POV

Egocentric human video of real tasks - the visual prior for embodied perception and affordance learning.

Full-Body Motion Capture

Rokoko-suit 3D skeletal tracking with joint-level body landmarks, time-synced to video.

Multi-View Sync

First-person and external viewpoints aligned to the same timeline for cross-view learning.

04Methodology

How it is built

01
Egocentric and external capture
Human demonstrations are captured as first-person (POV) and external (third-person) video of real tasks in H.264.
02
Pose estimation
MediaPipe extracts 3D skeletal keypoints: 33 body landmarks for the full body and 21 landmarks per hand for fine manipulation, in world coordinates in meters.
03
Skeleton structuring
Keypoints are stored as per-frame XYZ arrays with an explicit connection edge list defining the skeleton graph, plus per-landmark visibility.
04
AI enrichment
A vision-language model auto-generates the task name, description, objects present, and actions, with timestamped narration for language-conditioned policies.
05
Preview and packaging
Each clip is rendered to a preview GIF with tracking overlay and an annotated pose frame, bundled with the raw video, pose JSON, packed keypoints, and a manifest.
06
Public sample surfacing
Real human-POV clips are kept as their own line, served from storage object URLs, separate from any synthetic or generated POV.

05Evals

How we validate

What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.

Pose tracking confidence

Measures

Whether extracted skeletal keypoints are reliable per frame.

Method

Per-landmark visibility scores are stored alongside keypoints and tracked as a QA metric.

Result

Methodology-stage. Per-frame visibility is recorded; no published aggregate accuracy.

Skeleton structural validity

Measures

Whether keypoint arrays conform to the expected landmark count and connection graph.

Method

Validate the frame x landmark x 3 array shape against the declared landmark count (33 body, 21 hand) and the connection edge list.

Result

Methodology-stage. Shape and consistency check.

AI annotation QA

Measures

Whether the auto-generated task, object, and narration labels are correct.

Method

Generated metadata is reviewed in admin QA before customer visibility.

Result

Methodology-stage. Human review gate; no scored metric.

06Graders

Ground truth

What correct means for this data, and how it is established.

Ground truth

The human actual demonstrated motion is the ground truth, and the MediaPipe 3D skeleton plus 21-landmark-per-hand track is its structured representation. Task and object labels are graded against human-reviewed annotation.

How it is established

Per-frame visibility scores flag low-confidence keypoints, admin human QA reviews enrichment outputs, and structural validity is checked against the declared landmark and connection schema.

Agreement

No inter-rater or keypoint-accuracy agreement statistic is published.

07Application

Foundation Model Pretraining

Large-scale egocentric video and motion for visual and spatial pretraining of embodied models.

Human-to-Robot Transfer

3D human motion mapped toward robot morphologies for retargeting and demonstration transfer.

Human-Robot Interaction

Multi-perspective human behavior with pose annotations for interaction and intent modeling.

08Environment & integration

How you load it

Delivery

S3, REST API, CDN, Public preview URLs

Formats

H.264 MP4, JSON pose, NPZ packed keypoints, GIF / PNG previews, JSON manifest

Auth

The public sample tier is open via preview URLs; the full dataset is behind org-scoped API keys with tenant isolation.

Cadence

Continuous collection with batch delivery; 26 accepted human-POV previews are currently public.

quickstart.sh

# Load a human-POV clip and its 3D skeleton
# 1. Resolve a sample's urls (video, pose_data, keypoints_3d, preview_gif)
# 2. Download video.mp4 and load body_keypoints_3d.npz with NumPy
# 3. Read pose_data [F, 33, 3] + connections to render the skeleton

09Related research

Sim2Real Gap in Bipedal HumanoidsRead →

Request access.

Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.

Talk to us team@gerra.com

Human POV & Motion Capture

See the data

Record shape

First-Person POV

Full-Body Motion Capture

Multi-View Sync

How it is built

Egocentric and external capture

Pose estimation

Skeleton structuring

AI enrichment

Preview and packaging

Public sample surfacing

How we validate

Pose tracking confidence

Skeleton structural validity

AI annotation QA

Ground truth

Foundation Model Pretraining

Human-to-Robot Transfer

Human-Robot Interaction

How you load it

Request access.

Product

Company

Connect