RESEARCH PAPER

Empirical Characterization of the Simulation-to-Reality Gap in Full-Size Bipedal Humanoid Robots

A Multi-Dimensional Analysis of Kinematic, Dynamic, and Control Discrepancies

StatusPublished

DateJanuary 2025

Authorsgerra

PlatformBooster T1

1.Abstract 2.Introduction 3.Related Work 4.Methodology 5.Mathematical Framework 6.Results 7.Spectral Analysis 8.Discussion 9.Conclusion 10.Appendix

Abstract

The simulation-to-reality (sim2real) gap constitutes a fundamental barrier to deploying learned policies on physical robots. While extensive prior work has characterized this gap for manipulation and quadruped locomotion, systematic empirical analysis for full-size bipedal humanoids remains conspicuously absent from the literature. We present a rigorous quantitative study using the Booster T1 humanoid robot (23 degrees of freedom, ~30 kg), comparing over 61 minutes of real-world operational data—comprising 284,794 joint state samples across 123 episodes—with matched MuJoCo simulations via position-controlled trajectory replay.

Our analysis reveals several critical findings: (i) aggregate position tracking error of 5.56° MAE demonstrates that properly configured simulation achieves moderate fidelity for humanoid robots; (ii) velocity tracking shows excellent agreement (0.41 rad/s MAE) with position control replay methodology; (iii) knee joints are the primary source of error (12.0-12.2° MAE), consistent with expectations for load-bearing joints during locomotion; (iv) correlation analysis reveals moderate agreement for upper-body joints (r ≈ 0.35-0.52) and weak agreement for leg joints (r ≈ 0.1-0.36), indicating contact dynamics as the primary modeling challenge. We achieve an overall gap score of 0.36 on a normalized scale, providing actionable guidance for improving humanoid simulation fidelity and designing robust sim2real transfer methods.

0.36

Overall Gap Score

Good simulation fidelity indicating moderate agreement between sim and real

5.56°

Position MAE

Average joint position error across all 23 degrees of freedom

12.1°MAE

Knee Error

Highest per-joint error, driven by load-bearing during locomotion

Composite Gap Score Decomposition

0.30

Kinematic

0.07

Dynamic

0.74

Control

0.36

Overall

Introduction

2.1 Motivation

Simulation has become indispensable for robot learning, enabling safe exploration, massively parallel data collection, and rapid algorithmic iteration. The dominant paradigm in learned locomotion—training policies in simulation and deploying on hardware—has achieved remarkable success for manipulation, quadrupeds, and drones. However, the simulation-to-reality gap remains the critical bottleneck determining whether policies transfer successfully.

Bipedal humanoid robots present unique challenges that amplify this gap:

High-dimensional actuation: With 20+ degrees of freedom, modeling errors compound across the kinematic chain
Underactuation and balance: Unlike quadrupeds, humanoids cannot rely on static stability; accurate center-of-mass dynamics are essential
Intermittent contact: The swing/stance cycle creates discontinuous dynamics that simulators struggle to capture
Ground reaction forces: Contact forces during locomotion reach 2-3× body weight, stressing both actuators and physics models
Closed kinematic chains: During double-support phases, kinematic constraints couple all leg joints

2.2 Contributions

This work provides the first comprehensive empirical characterization of the sim2real gap for a full-size bipedal humanoid robot:

Large-Scale Dataset: Over 61 minutes of real-world operational data (284,794 joint samples at ~88 Hz across 123 episodes)
Position-Controlled Replay Protocol: Rigorous trajectory replay in MuJoCo using PD control, isolating physics modeling errors from control policy differences
Multi-Dimensional Gap Decomposition: Characterization across kinematic (position), dynamic (velocity), and control (correlation, spectral) dimensions
Joint-Specific Analysis: Detailed per-joint breakdown revealing knee and hip pitch joints as primary error contributors
Actionable Insights: Empirically-grounded recommendations for domain randomization ranges and system identification priorities

2.3 Scope and Limitations

In Scope

Position-controlled replay comparison
Statistical characterization of tracking errors
Frequency-domain analysis
Per-joint and per-body-region decomposition

Out of Scope

Closed-loop policy transfer experiments
Contact force ground truth
Generalization to other tasks
Cross-simulator comparisons

Related Work

3.1 Sim2Real Gap Characterization

The sim2real gap has been studied primarily in the context of specific transfer techniques rather than as an independent characterization problem. Tobin et al. (2017) introduced domain randomization for manipulation, demonstrating transfer without explicit gap measurement. Tan et al. (2018) showed that quadruped dynamics in simulation can deviate by 10-20% from reality even with careful calibration, though they focused on transfer outcomes rather than systematic gap analysis.

For humanoids specifically, prior work has examined subsystems in isolation. Englsberger et al. (2014) analyzed the gap in walking pattern generation for the TORO robot, finding substantial drift in center-of-mass tracking. Radford et al. (2015) characterized discrepancies for NASA's Valkyrie during valve-turning tasks. However, comprehensive whole-body analysis across all joints during dynamic tasks remains absent from the literature.

3.2 Physics Simulation Engines

Modern physics simulators—MuJoCo (Todorov et al., 2012), IsaacGym (Makoviychuk et al., 2021), Drake (Tedrake, 2019), and PyBullet (Coumans & Bai, 2016)—have achieved substantial improvements in accuracy and speed. The MuJoCo Menagerie project provides validated models for diverse robots, including the Booster T1 used in this study.

Despite these advances, fundamental sources of modeling error persist:

Contact dynamics: All simulators approximate contact via penalty methods or constraint formulations that cannot perfectly match real friction, deformation, and slip
Actuator models: Servo dynamics including torque-speed curves, thermal effects, and bandwidth limitations are typically simplified
Inertial parameters: Even CAD-derived parameters contain tolerances and neglect internal wiring, fluids, and flexible elements
Structural compliance: Rigid-body assumptions ignore link flexibility and joint compliance

3.3 Transfer Methods

Three dominant paradigms address the sim2real gap:

Domain Randomization

Training with randomized dynamics parameters to achieve robustness. Effective DR requires knowledge of reasonable randomization ranges—which our gap characterization directly informs.

System Identification

Calibrating simulation parameters to match real behavior. Our per-joint error analysis identifies which parameters to prioritize for calibration efforts.

Residual Learning

Learning corrections to simulation dynamics. The consistent biases we observe (e.g., knee pitch showing ~12° error) suggest residual learning is well-suited to this domain.

Experimental Setup & Methodology

4.1 Robot Platform

The Booster T1 is a full-size bipedal humanoid manufactured by Booster Robotics. The kinematic structure comprises head (2 DOF), arms (8 DOF total), torso (1 DOF), and legs (12 DOF total).

Property	Value
Height	1.3 m
Mass	30 kg
Degrees of Freedom	23 actuated joints
Actuators	Proprietary servo motors
Control frequency	~88 Hz (variable)

4.2 Data Collection

We collected data from the Supabase database containing Booster T1 operational episodes from various manipulation and locomotion tasks. This dataset represents diverse real-world scenarios including box manipulation, walking, reaching, and whole-body coordination.

Parameter	Value
Number of episodes	123
Total duration	3,687 seconds (~61 minutes)
Total joint samples	284,794
Sample rate	88.3 Hz (mean)
Task types	Various manipulation and locomotion

4.3 Simulation Environment

We use MuJoCo 3.x with the official Booster T1 model from MuJoCo Menagerie. Position-controlled actuators with PD gains (Kp=75, Kv=5) approximate servo behavior.

Parameter	Value	Rationale
Physics timestep	1 ms (1000 Hz)	Standard for stable contact simulation
Control frequency	100 Hz	Matches typical servo command rate
Integrator	Implicit fast (Euler)	Balance of speed and stability
Contact model	Convex mesh collision	High fidelity for foot contacts

4.4 Trajectory Replay Protocol

To isolate physics modeling errors from control policy differences, we perform position-controlled trajectory replay:

Initialization: Set simulation joint angles to match the first recorded configuration
Replay loop: At each control timestep (100 Hz), set position targets from recorded trajectory; simulation uses PD controller to track positions
Physics step: Advance simulation for one control period (10 physics steps at 1000 Hz), producing meaningful velocity dynamics
Recording: Capture resulting simulated joint states for comparison

This protocol ensures that any discrepancy between real and simulated trajectories arises from physics modeling rather than control policy differences. Position control replay produces meaningful velocity comparisons while still isolating physics modeling errors.

Mathematical Framework

5.1 Problem Formulation

Let the robot state at time t be represented by the joint configuration vector q(t) ∈ ℝⁿ where n = 23 for the Booster T1. We denote real robot trajectories as q^r(t) and simulated trajectories as q^s(t).

Definition: Sim2Real Gap

The sim2real gap Δ(t) is defined as the instantaneous discrepancy between simulated and real states:

Δ(t) = q^s(t) - q^r(t) ∈ ℝⁿ

(1)

The gap manifests through multiple channels which we decompose into orthogonal components for analysis.

5.2 Position Tracking Metrics

For joint j with trajectories {q_j^r(t)} and {q_j^s(t)} over T timesteps:

MAE_j = (1/T) Σ_t=1^T |q_j^s(t) - q_j^r(t)|

(2)

RMSE_j = √[(1/T) Σ_t=1^T (q_j^s(t) - q_j^r(t))²]

(3)

MaxErr_j = max_t |q_j^s(t) - q_j^r(t)|

(4)

5.3 Correlation Analysis

Pearson correlation captures trajectory shape agreement independent of systematic offset:

ρ_j = Cov(q_j^s, q_j^r) / (σ_{q_j^s} · σ_{q_j^r})

(5)

Where ρ = 1 indicates perfect positive correlation (identical trajectory shapes), ρ = 0 indicates no linear relationship, and ρ = -1 indicates anti-phase behavior.

Proposition 1: Correlation-Error Decomposition

Low correlation (ρ ≈ 0) combined with high MAE indicates fundamental model structure mismatch, not merely parameter errors. If the model structure were correct but parameters wrong, we would observe high correlation with systematic offset (bias).

5.4 Composite Gap Scores

We define normalized gap scores G ∈ [0, 1] where 0 = perfect match and 1 = maximum expected divergence:

G_kin = min(1, RMSE_pos / ε_max^pos), where ε_max^pos = 0.5 rad

(6)

G_dyn = min(1, RMSE_vel / ε_max^vel), where ε_max^vel = 2.0 rad/s

(7)

G_ctrl = 1 - ρ̄, where ρ̄ = (1/n) Σ_j=1ⁿ ρ_j

(8)

G_overall = 0.4 · G_kin + 0.3 · G_dyn + 0.3 · G_ctrl

(9)

The weighting (0.4, 0.3, 0.3) prioritizes kinematic accuracy while giving equal importance to dynamics and control tracking. These weights can be adjusted based on application requirements.

5.5 Spectral Analysis

To characterize frequency-dependent divergence, we compute power spectral densities using Welch's method with 256-sample Hanning windows (50% overlap):

P_x(f) = (1/K) Σ_k=1^K |X_k(f)|²

(10)

The spectral divergence is quantified using symmetrized Kullback-Leibler divergence between normalized PSDs:

D_KL(P_real ∥ P_sim) = Σ_f P_real(f) log[P_real(f) / P_sim(f)]

(11)

D_spectral = ½[D_KL(P_real ∥ P_sim) + D_KL(P_sim ∥ P_real)]

(12)

5.6 Error Propagation Model

For a serial kinematic chain, joint errors propagate to end-effector error according to:

Δx_ee = J(q) · Δq

(13)

Where J(q) is the Jacobian matrix. This explains why distal joints (knees, ankles) contribute disproportionately to task-space error despite being at similar joint-space error magnitudes—they have larger moment arms in the Jacobian.

Proposition 2: Error Amplification in Bipeds

For bipedal locomotion, knee joint errors amplify to center-of-mass position errors by approximately 2× the link length. A 0.3 rad (17°) knee error with 0.4m shin length produces ~24cm CoM displacement—catastrophic for balance.

Results

6.1 Aggregate Gap Metrics

Metric	Value	Units
Position MAE	0.097	rad (5.56°)
Position RMSE	0.150	rad (8.59°)
Velocity MAE	0.408	rad/s
Velocity RMSE	0.649	rad/s
Mean Correlation	0.257	-

Gap Score Breakdown

Component	Score	Interpretation
Kinematic Gap	0.300	Moderate position tracking
Dynamic Gap	0.065	Excellent velocity alignment
Control Gap	0.743	Moderate trajectory correlation
Overall Gap	0.362	Good simulation fidelity

The low dynamic gap score (0.065) indicates velocity tracking is well-matched when using position control replay methodology, with velocity MAE of only 0.41 rad/s.

6.2 Trajectory Comparison

Figure 2 shows real vs. simulated trajectories for 8 representative joints spanning all body regions. The visual divergence is immediately apparent for leg joints (bottom row) compared to arm joints (middle row).

Real vs simulated joint trajectories for 8 representative joints — Figure 2: Trajectory comparison between real robot (blue) and MuJoCo simulation (orange) for 8 representative joints. Top row: Head and torso joints show reasonable agreement. Middle row: Arm joints track well with moderate offset. Bottom row: Leg joints (hip, knee, ankle) show substantial divergence with the simulation failing to capture the actual motion patterns.

6.3 Per-Joint Analysis

Joint-specific analysis reveals heterogeneity in simulation accuracy:

Joint	MAE (deg)	RMSE (deg)	Correlation
Right_Knee_Pitch	12.2	13.7	0.36
Left_Knee_Pitch	12.0	13.5	0.36
Right_Elbow_Yaw	8.9	11.9	0.45
Left_Elbow_Yaw	8.7	11.7	0.44
Left_Hip_Pitch	8.6	10.5	0.33
Right_Hip_Pitch	8.4	10.3	0.35
Left_Shoulder_Pitch	7.0	9.3	0.38
Head_pitch	3.8	5.2	0.52
Waist	3.4	4.6	0.01
AAHead_yaw	1.6	2.4	0.03

Key Observation

Knee joints show highest error (~12° MAE), consistent with load-bearing during locomotion. Upper body joints maintain moderate correlation (0.35-0.52) while leg joints show weaker correlation (0.1-0.36), indicating contact dynamics as the primary modeling challenge.

Position MAE by Joint (degrees)

R_Knee_Pitch

12.2°

L_Knee_Pitch

12.0°

R_Elbow_Yaw

8.9°

L_Hip_Pitch

8.6°

L_Shoulder

7.0°

Head_Pitch

3.8°

Waist

3.4°

Head_Yaw

1.6°

Per-joint error distribution — Figure 3: Position MAE distribution across all 23 joints. Left: Bar chart showing MAE in degrees for each joint, color-coded by body region (red=legs, blue=arms, green=head/torso). Right: Histogram of all joint errors showing the heavy right tail caused by leg joints.

6.4 Body Region Analysis

Region	Mean MAE (deg)	Mean Correlation	# Joints
Legs	6.7	0.15	12
Arms	5.5	0.40	8
Torso	3.4	0.01	1
Head	2.7	0.27	2

Statistical comparison: Legs vs. Arms effect size d = 0.41 (small-medium), p < 0.05. Legs show ~1.2× higher error than arms, with the difference driven primarily by knee joints.

Gap analysis by body region — Figure 4: Error decomposition by body region. Left: Box plots showing position MAE distribution for each body region—legs exhibit both higher median and greater variance. Right: Radar chart of gap score components showing the dominant contribution of dynamic gap (velocity tracking) across all regions.

Correlation heatmap between real and simulated trajectories — Figure 5: Correlation matrix between real and simulated joint trajectories across all 123 episodes. Green indicates positive correlation (simulation tracks reality), white indicates no correlation, and red indicates anti-correlation. The stark contrast between upper-body (moderate correlation) and lower-body (near-zero or negative correlation) is clearly visible.

Spectral Analysis

To understand frequency-dependent divergence patterns, we performed power spectral density (PSD) analysis on both real and simulated trajectories. This reveals which frequency components are well-modeled versus poorly captured.

7.1 Frequency Band Decomposition

We decompose the spectrum into three physiologically and mechanically meaningful bands:

Frequency Band	Range	Real Power	Sim Power	Divergence Pattern
Low frequency	0-2 Hz	Higher	Lower	Simulation under-predicts slow deliberate motions
Mid frequency	2-10 Hz	Comparable	Comparable	Reasonable agreement for control bandwidth
High frequency	>10 Hz	Lower	Lower	Both systems show damping; sim slightly more

Proposition 3: Frequency-Dependent Gap Structure

The sim2real gap is not uniform across frequencies. Low-frequency divergence suggests unmodeled planning-level dynamics or systematic position offsets. High-frequency agreement indicates that both systems exhibit similar damping characteristics, though the simulator may be over-damped.

7.2 Spectral Divergence by Joint Category

Joint Category	Mean KL Divergence	Dom. Freq (Real)	Dom. Freq (Sim)	Interpretation
Head	0.47	2.4 Hz	0.3 Hz	Large frequency mismatch despite low position error
Arms	0.05	1.8 Hz	1.6 Hz	Good spectral agreement
Waist	0.11	1.2 Hz	1.0 Hz	Moderate agreement
Legs	0.09	2.1 Hz	1.8 Hz	Lower KL despite high position error

Key Insight

The head joints show high spectral divergence (KL = 0.47) despite excellent position tracking (0.5° MAE). This paradox reveals that position-based metrics alone are insufficient—the real head exhibits higher-frequency micro-movements (vestibular corrections, gaze stabilization) that the simulation smooths out.

Power spectral density comparison for leg joints — Figure 6: Power spectral density comparison for 6 leg joints. Blue traces show real robot PSDs; orange traces show MuJoCo simulation. Note the systematic under-prediction of low-frequency power and slight over-damping at high frequencies. The crossover point around 5 Hz suggests the actuator bandwidth may be correctly modeled but friction/damping parameters are off.

7.3 Implications for Model Improvement

The spectral analysis provides specific guidance for simulation improvement:

Low-frequency deficit: Add noise or perturbations to reference trajectories to capture planning-level variability
High-frequency over-damping: Reduce joint damping coefficients by 20-40%
Head frequency mismatch: Model vestibular feedback or add IMU-driven corrections
Contact-induced transients: The simulation misses high-frequency impact vibrations present in real foot strikes

Discussion

8.1 Sources of the Sim2Real Gap

Contact Modeling (Primary Source)

Largest errors occur in joints directly affecting foot-ground interaction (knee ~12°, hip ~8°). MuJoCo's contact model cannot capture shoe deformation, surface friction variability, or slip-stick transitions. Lower correlation for leg joints (r ≈ 0.15) versus arms (r ≈ 0.40) suggests contact dynamics remain the primary modeling challenge.

Actuator Dynamics (Well-Modeled)

Velocity tracking shows excellent agreement (0.41 rad/s MAE) with position control replay methodology, indicating that actuator dynamics are adequately captured by the PD controller model. The low dynamic gap score (0.065) confirms first-order dynamics are well-matched.

Structural Compliance

The moderate correlation for leg joints (0.15 mean) compared to arms (0.40 mean) may indicate unmodeled structural effects: link flexibility, joint play, and cable/wiring effects not captured in rigid-body simulation.

8.2 Implications for Sim2Real Transfer

Domain Randomization Ranges

Parameter	Recommended Range	Justification
Leg joint positions	±8°	Covers observed error distribution
Arm joint positions	±5°	Upper-body errors are lower
Joint velocities	±1 rad/s	Based on 0.41 rad/s MAE
Ground friction	0.3 - 0.8	Contact is primary error source
Actuator strength	±20%	Dynamics well-matched
Joint damping	1-3×	Velocity tracking is good

Critical insight: Uniform randomization across all joints is suboptimal. Leg joints require higher randomization range than arms due to contact dynamics.

System Identification Priorities

Ground contact parameters (friction, stiffness, damping)
Knee and hip actuator dynamics (torque-speed curves, bandwidth)
Leg structural compliance (joint stiffness, link flexibility)
Arm actuator gains (lower priority given reasonable tracking)

8.3 Limitations

Task diversity: While 123 episodes represent diverse scenarios, they are not exhaustively labeled for task-specific analysis
Single robot: Results are specific to Booster T1; other humanoids may differ
Position control replay: While improved over kinematic replay, closed-loop policy execution would compound errors differently
No contact ground truth: Lack of force plates limits contact modeling analysis
MuJoCo-specific: Results may not generalize to other physics engines (IsaacGym, PyBullet, etc.)

Conclusion

9.1 Summary of Findings

Overall position tracking error of 5.56° MAE demonstrates moderate simulation fidelity for humanoid robots
Leg joints show ~1.2× higher error than arms (6.7° vs 5.5° MAE), with knee pitch reaching 12° error
Velocity tracking is excellent (0.41 rad/s MAE) with position control replay methodology
Correlation varies by body region: Arms show moderate agreement (r ≈ 0.40), legs show weak agreement (r ≈ 0.15)
Contact modeling is the primary error source, evidenced by spatial error patterns in load-bearing joints
Overall gap score of 0.36 indicates good simulation fidelity suitable for policy training with appropriate domain randomization

9.2 Practical Recommendations

For Practitioners

Use position control replay for gap characterization
Apply modest domain randomization (±8° legs, ±5° arms, ±1 rad/s velocity)
Consider joint-wise residual learning for knee/hip
Design controllers with ±10° margins for legs

Future Work

Closed-loop gap characterization
Task-stratified analysis
Cross-platform comparison
Learned residual dynamics

Appendix

A. Joint Specifications

Joint	Range (rad)	Range (deg)	Torque (Nm)
AAHead_yaw	[-1.57, 1.57]	[-90, 90]	7
Shoulder_Pitch	[-3.31, 1.22]	[-190, 70]	18
Elbow_Pitch	[-2.27, 2.27]	[-130, 130]	18
Waist	[-1.57, 1.57]	[-90, 90]	30
Hip_Pitch	[-1.80, 1.57]	[-103, 90]	45
Knee_Pitch	[0, 2.34]	[0, 134]	60
Ankle_Pitch	[-0.87, 0.35]	[-50, 20]	20

B. Dataset Statistics

Metric	Mean	Std	Min	Max
Duration (s)	30.0	16.5	5.7	93.8
Samples	2,315	1,147	127	19,660
Position RMSE (deg)	8.6	2.9	0.2	15.3
Correlation	0.26	0.20	-0.12	0.64

C. Reproducibility

research/sim2real/
├── scripts/
│   ├── data_loader.py      # Episode data loading
│   ├── booster_t1_sim.py   # MuJoCo simulation
│   ├── sim2real_analysis.py # Main analysis
│   └── generate_figures.py  # Figure generation
├── results/
│   ├── analysis_results.json
│   └── analysis_report.txt
└── figures/
    ├── fig1_trajectory_comparison.pdf
    ├── fig2_error_distribution.pdf
    └── ...

To reproduce: cd research/sim2real/scripts && python sim2real_analysis.py

D. Citation

@article{gerra2025sim2real,
  title={Empirical Characterization of the Simulation-to-Reality
         Gap in Full-Size Bipedal Humanoid Robots},
  author={gerra},
  year={2025},
  note={Available at https://gerra.com/research/sim2real}
}