Challenges Solved
- BridgeAttention Architecture: Implemented BridgeAttention architecture to map multimodal inputs (vision + language + proprioception) to 43-D action spaces.
- Policy Training: Trained a policy head on top of frozen SigLIP and Qwen2.5 backbones, achieving MSE 0.062 on offline action reconstruction.
- Data Engineering: Solved data scarcity by curating a specialized augmented dataset (1.5k+ downloads) for robotic manipulation tasks.
Signal
Robotics / Vision-Language-Action Models / SOTA Implementation
Technical Depth
- Trained policy head on frozen SigLIP/Qwen2.5 backbones (MSE 0.062).
- 64 learnable query tokens.
- Separate linear projections for vision, text, and state into a shared 512-D policy space.
- Learnable modality gates (
alpha_v, alpha_t, alpha_s) passed through sigmoid.
- A 4-layer Transformer encoder with 8 attention heads.
- Mean pooling over query tokens, followed by an MLP head to 43-D actions.
Artifacts