<- Back to Projects

Dec 1, 2025

RL for Mathematical Reasoning (Gemma-3 Fine-tuning)

ML/DLReasoningGRPOTraining

Challenges Solved

GRPO Implementation: Implemented GRPO (Group Relative Policy Optimization) to fine-tune Gemma-3-270M for Chain-of-Thought reasoning.
Reward Engineering: Engineered custom reward functions combining sympy-based correctness checks with reasoning efficiency penalties.
VRAM Optimization: Optimized training for 15GB VRAM constraints using gradient accumulation, BF16 mixed-precision, and efficient attention mechanisms.

Signal

RLHF / Post-Training / Foundation Models

Technical Depth

Training data comes from formatted math prompts with explicit {thinking} and {answer} delimiters.
Batch size 16 with gradient accumulation 4 (effective batch 64).
2 GRPO epochs.
LR 5e-6 with cosine scheduling and 3% warmup.
BF16 or FP16 with eager attention and use_cache=False.
Checkpointing every 500 steps; logs reward mean/std, KL, loss, and output length.

Links

Model: https://huggingface.co/Nirav-Madhani/gemma3-270m-grpo-math