Nguyen Le PhongNguyen Le Phong

Evaluating LLM Performance

A practical guide to evaluating LLM performance with task-specific datasets, rubrics, human review, regression checks, latency, cost, safety, and production feedback instead of relying on impressive demos.

The demo looked good on the big screen. A prompt went in, a polished answer came out, and for a few minutes the room felt lighter. Then someone opened the spreadsheet of real support tickets, the ones with incomplete context, strange wording, angry customers, missing attachments, and product names that sounded almost the same. The model still looked useful, but the question changed from “is this impressive?” to “can we trust this often enough?”

Evaluating LLM performance begins when we stop treating the model as a magic box and start treating it as part of a product system. A language model can sound confident while being wrong. It can be right on easy examples and fragile on the cases that matter. It can improve one metric while making latency, cost, privacy, or user trust worse. A good evaluation does not remove uncertainty, but it makes the uncertainty visible.

The first step is to define the task in ordinary product language. Are we summarizing long documents, extracting fields, answering from a knowledge base, drafting replies, classifying risk, generating code, or helping a user search? “Good answer” means different things in each case. A concise summary may be valuable for one workflow and dangerous for another if it omits a required detail. Evaluation has to belong to the job, not to the model in general.

A useful dataset should include more than clean examples. It needs common cases, edge cases, historical failures, ambiguous inputs, outdated information, adversarial phrasing, and examples where the correct behavior is to say “I do not know.” If the eval set only contains happy paths, it will reward a model for performing well in a world the product does not actually live in. The messy tickets, old docs, and awkward user language are not noise. They are the test.

Rubrics help turn taste into judgment. For a support answer, we might score factual accuracy, completeness, tone, citation quality, policy compliance, and whether the answer asks for missing information instead of inventing it. For code generation, we might check correctness, security, readability, test coverage, and fit with the existing codebase. The rubric does not need to be complicated at first. It needs to be explicit enough that two reviewers can discuss disagreement without only saying “I like this one better.”

Automated checks are useful, but they should be humble. Exact match can work for structured extraction, but it fails for open-ended writing. LLM-as-judge can help compare answers at scale, but it can inherit bias, miss domain nuance, or overvalue fluent explanations. Unit tests can catch formatting rules and required fields. Retrieval metrics can check whether the right documents were found. Human review still matters, especially for high-risk flows where being plausibly wrong is worse than being obviously incomplete.

Regression testing is where evaluation becomes a habit. Every prompt change, model upgrade, retrieval tweak, or system instruction adjustment can fix one case and break another. Without a stable eval set, the team is steering by memory. With one, the team can ask a calmer question: did this change improve the cases we care about without damaging previous behavior? This is less exciting than a demo, but it is much closer to engineering.

Performance is not only answer quality. Latency changes how users feel. Cost changes whether a feature can scale. Token usage changes whether long context remains affordable. Refusal behavior changes user trust. Safety behavior changes legal and operational risk. A model that is slightly better on quality but twice as slow may not be better for the actual product. The evaluation dashboard should reflect the trade-offs the business and users will live with.

Production feedback completes the loop. Users abandon flows, edit generated drafts, thumbs-down answers, contact support, retry prompts, or stop using the feature quietly. Logs, PostHog events, manual review queues, and incident notes can all show where the offline eval missed reality. The goal is not to watch users suspiciously. It is to learn where the system helped, where it overreached, and where the next eval examples should come from.

I trust LLM features more when the team can show their evaluation scars. Not only the best examples, but the cases they failed, the rules they added, the model changes they rejected, and the trade-offs they accepted. If you are building with AI, the most useful question may not be “which model is smartest?” It may be “what would convince us that this behavior is reliable enough for this user, in this workflow, today?”

What did you think?