Evals, tests, and LLM applications

April 9, 2024

Meta-TLDR: It’s both feasible and advisable to be eval-driven in LLM application development.

TLDR: Five lessons from the trenches (in no particular order)

L1: Build your evals out of necessary, insufficient, and simple “spot checks.”
- L1B: Opinion: don’t waste your time with self-evaluation.
L2: Unit- and end-to-end evaluations for LLM-powered functionality are both meaningful but for different reasons
L3: You need eval coverage of every LLM invocation in your codebase, even if you think you don’t.
L4: LLM mocks (and the unit tests that invoke them) are not a good use of time.
L5: Caching at the LLM layer is a hiltless sword

Untitled

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

The tree being chopped is an LLM application, and the axe sharpening is one’s evaluation suite.

Devs have long extolled the virtues and essentiality of software testing, in the same way that experienced ML practitioners tend to fixate on evals, but the two practices have some divergences.

Software testing	ML Evaluations
Automated	Sometimes manual
Fast	Slow
Binary pass/fail	Grayscale metric
Test-driven development	SOTA progression

Without LLMs, the interface between software and ML tends to be clean. If you’re building a mobile app that recognizes hot dogs, you can neatly isolate the visual classifier ML component from the rest of your system, and isolate each for tests (software) and evals (ML).

Untitled from FigJam (1).jpg

In contrast, LLM-powered applications often find themselves sprinkling LLM invocations in control logic throughout the codebase. You actually tend to make more and more such invocations over time as the system’s sophistication increases — and as LLM inference costs continue to fall, we’ll likely start to see this pattern appear in more and more applications.

Untitled from FigJam (2).jpg

What kind of validation workflow makes sense in this kind of world?

The cherry on top: we often need to be evaluating generative outputs, which even under isolated circumstances has been a challenge for the ML community.

TLDR: Five lessons from the trenches (in no particular order)

Lesson 1: Spot checks are your friend