Nirav Madhani
<- Back to Projects
Nov 1, 2025

VLA-Adapter for NVIDIA GR00T Humanoid

RoboticsVLATransformersPolicy

Challenges Solved

  • BridgeAttention Architecture: Implemented BridgeAttention architecture to map multimodal inputs (vision + language + proprioception) to 43-D action spaces.
  • Policy Training: Trained a policy head on top of frozen SigLIP and Qwen2.5 backbones, achieving MSE 0.062 on offline action reconstruction.
  • Data Engineering: Solved data scarcity by curating a specialized augmented dataset (1.5k+ downloads) for robotic manipulation tasks.

Signal

Robotics / Vision-Language-Action Models / SOTA Implementation

Technical Depth

  • Trained policy head on frozen SigLIP/Qwen2.5 backbones (MSE 0.062).
  • 64 learnable query tokens.
  • Separate linear projections for vision, text, and state into a shared 512-D policy space.
  • Learnable modality gates (alpha_v, alpha_t, alpha_s) passed through sigmoid.
  • A 4-layer Transformer encoder with 8 attention heads.
  • Mean pooling over query tokens, followed by an MLP head to 43-D actions.

Artifacts