April 9, 2024
Meta-TLDR: It’s both feasible and advisable to be eval-driven in LLM application development.
Give me six hours to chop down a tree and I will spend the first four sharpening the axe.
The tree being chopped is an LLM application, and the axe sharpening is one’s evaluation suite.
Devs have long extolled the virtues and essentiality of software testing, in the same way that experienced ML practitioners tend to fixate on evals, but the two practices have some divergences.
Software testing | ML Evaluations |
---|---|
Automated | Sometimes manual |
Fast | Slow |
Binary pass/fail | Grayscale metric |
Test-driven development | SOTA progression |
Without LLMs, the interface between software and ML tends to be clean. If you’re building a mobile app that recognizes hot dogs, you can neatly isolate the visual classifier ML component from the rest of your system, and isolate each for tests (software) and evals (ML).
In contrast, LLM-powered applications often find themselves sprinkling LLM invocations in control logic throughout the codebase. You actually tend to make more and more such invocations over time as the system’s sophistication increases — and as LLM inference costs continue to fall, we’ll likely start to see this pattern appear in more and more applications.
What kind of validation workflow makes sense in this kind of world?
The cherry on top: we often need to be evaluating generative outputs, which even under isolated circumstances has been a challenge for the ML community.