January 30, 2025

TLDR

As of today, RL based post-training appears to be feasible, affordable, and maps well to code-based tasks (i.e. agentic workflows!). If data was once the new gold, it is starting to look like evals might be the new platinum.

**Vinith was wrong. Your evals actually do need to be perfect.** While it’s true that a corpus of simple, uncorrelated, necessary-but-insufficient “spot checks” is great for a held-out evaluation score, evals are now becoming training signals. RL is a ruthlessly efficient paperclip maximizer, and will take full advantage of any gameability in the reward signal. Write fewer evals, and make them bulletproof.

Evaluate with production data. No matter how thorough your offline tests, real-world usage (e.g., your user actually trying to buy orange soda at Safeway) is a limitless supply of evaluation scenarios, surfaces the unexpected-but-important, and guarantees ongoing alignment between your models and your users.

Skip the rest if you must, but try to hold onto these three points. The RL wave is turning your evals into the dataset, and your investment in them will likely outlive any piece of code you write, any model you train, and any product feature you ship.

Read on for a slightly deeper story:R1-zero, a quick look at GRPO, and a practical run-down of what this means for evaluations.

Background

1. R1 & R1-zero: beyond r/wallstreetbets

So yes, there’s been a ton of noise around R1 — the pretraining efficiency of the base V3 model, the impressive / dizzying list of optimizations the DeepSeek team introduced, the product implications of its thinking tokens, speculation about distillation, etc.

But the quieter, arguably more game-changing element of the release is R1-zero: a simpler, pure-RL-post-training model produced along the way to R1. R1-Zero’s code-verification reinforcement loop is shockingly simple, as is the fact that it works at all.

At its heart is an even less-talked-about innovation, also from the DeepSeek team: the GRPO objective function.

Enter GRPO

The PPO objective function — one of the most common policy gradient methods — in a very simplified form looks sort of like this

$$ L(θ)=E_o[clipping(r_t(o;θ)A_t(o))] $$

where:

$o$ is an output observation