Nirav Madhani
<- Back to Projects
Dec 1, 2025

RL for Mathematical Reasoning (Gemma-3 Fine-tuning)

ML/DLReasoningGRPOTraining

Challenges Solved

  • GRPO Implementation: Implemented GRPO (Group Relative Policy Optimization) to fine-tune Gemma-3-270M for Chain-of-Thought reasoning.
  • Reward Engineering: Engineered custom reward functions combining sympy-based correctness checks with reasoning efficiency penalties.
  • VRAM Optimization: Optimized training for 15GB VRAM constraints using gradient accumulation, BF16 mixed-precision, and efficient attention mechanisms.

Signal

RLHF / Post-Training / Foundation Models

Technical Depth

  • Training data comes from formatted math prompts with explicit {thinking} and {answer} delimiters.
  • Batch size 16 with gradient accumulation 4 (effective batch 64).
  • 2 GRPO epochs.
  • LR 5e-6 with cosine scheduling and 3% warmup.
  • BF16 or FP16 with eager attention and use_cache=False.
  • Checkpointing every 500 steps; logs reward mean/std, KL, loss, and output length.

Links