<- Back to Projects

Nov 1, 2025

VLA-Adapter for NVIDIA GR00T Humanoid

RoboticsVLATransformersPolicy

Challenges Solved

BridgeAttention Architecture: Implemented BridgeAttention architecture to map multimodal inputs (vision + language + proprioception) to 43-D action spaces.
Policy Training: Trained a policy head on top of frozen SigLIP and Qwen2.5 backbones, achieving MSE 0.062 on offline action reconstruction.
Data Engineering: Solved data scarcity by curating a specialized augmented dataset (1.5k+ downloads) for robotic manipulation tasks.

Signal

Robotics / Vision-Language-Action Models / SOTA Implementation

Technical Depth

Trained policy head on frozen SigLIP/Qwen2.5 backbones (MSE 0.062).
64 learnable query tokens.
Separate linear projections for vision, text, and state into a shared 512-D policy space.
Learnable modality gates (alpha_v, alpha_t, alpha_s) passed through sigmoid.
A 4-layer Transformer encoder with 8 attention heads.
Mean pooling over query tokens, followed by an MLP head to 43-D actions.

Artifacts