April 9, 2024

Meta-TLDR: It’s both feasible and advisable to be eval-driven in LLM application development.

TLDR: Five lessons from the trenches (in no particular order)

Untitled

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

The tree being chopped is an LLM application, and the axe sharpening is one’s evaluation suite.

Devs have long extolled the virtues and essentiality of software testing, in the same way that experienced ML practitioners tend to fixate on evals, but the two practices have some divergences.

Software testing ML Evaluations
Automated Sometimes manual
Fast Slow
Binary pass/fail Grayscale metric
Test-driven development SOTA progression

Without LLMs, the interface between software and ML tends to be clean. If you’re building a mobile app that recognizes hot dogs, you can neatly isolate the visual classifier ML component from the rest of your system, and isolate each for tests (software) and evals (ML).

Untitled from FigJam (1).jpg

In contrast, LLM-powered applications often find themselves sprinkling LLM invocations in control logic throughout the codebase. You actually tend to make more and more such invocations over time as the system’s sophistication increases — and as LLM inference costs continue to fall, we’ll likely start to see this pattern appear in more and more applications.

Untitled from FigJam (2).jpg

What kind of validation workflow makes sense in this kind of world?

The cherry on top: we often need to be evaluating generative outputs, which even under isolated circumstances has been a challenge for the ML community.

Lesson 1: Spot checks are your friend