January 30, 2025
Skip the rest if you must, but try to hold onto these three points. The RL wave is turning your evals into the dataset, and your investment in them will likely outlive any piece of code you write, any model you train, and any product feature you ship.
Read on for a slightly deeper story:R1-zero
, a quick look at GRPO
, and a practical run-down of what this means for evaluations.
So yes, there’s been a ton of noise around R1 — the pretraining efficiency of the base V3 model, the impressive / dizzying list of optimizations the DeepSeek team introduced, the product implications of its thinking tokens, speculation about distillation, etc.
But the quieter, arguably more game-changing element of the release is R1-zero: a simpler, pure-RL-post-training model produced along the way to R1. R1-Zero’s code-verification reinforcement loop is shockingly simple, as is the fact that it works at all.
At its heart is an even less-talked-about innovation, also from the DeepSeek team: the GRPO objective function.
The PPO objective function — one of the most common policy gradient methods — in a very simplified form looks sort of like this
$$ L(θ)=E_o[clipping(r_t(o;θ)A_t(o))] $$
where: