Challenges Solved
- GRPO Implementation: Implemented GRPO (Group Relative Policy Optimization) to fine-tune Gemma-3-270M for Chain-of-Thought reasoning.
- Reward Engineering: Engineered custom reward functions combining sympy-based correctness checks with reasoning efficiency penalties.
- VRAM Optimization: Optimized training for 15GB VRAM constraints using gradient accumulation, BF16 mixed-precision, and efficient attention mechanisms.
Signal
RLHF / Post-Training / Foundation Models
Technical Depth
- Training data comes from formatted math prompts with explicit
{thinking} and {answer} delimiters.
- Batch size 16 with gradient accumulation 4 (effective batch 64).
- 2 GRPO epochs.
- LR 5e-6 with cosine scheduling and 3% warmup.
- BF16 or FP16 with eager attention and
use_cache=False.
- Checkpointing every 500 steps; logs reward mean/std, KL, loss, and output length.
Links