Picking the right LLM for your workflow
The model is usually not the interesting part of the decision. When we start working with a client on an agent implementation, model selection comes somewhere in the middle of the scoping process, after we understand the task, the data, the edge cases, and the acceptable error rate. It rarely comes first.
That said, it matters. Choosing the wrong model — not in the "this model is generally worse" sense, but in the "this model is wrong for this specific task at this scale" sense — creates real problems in production.
The three things that actually determine model choice
Cost per call times volume gives you the monthly running cost. This sets a practical ceiling. A model that costs twenty times more per call is fine if you're making a hundred calls a day. At fifty thousand calls, that math changes quickly. For high-volume classification or extraction tasks, the cost curve of smaller models often wins even if the accuracy is slightly lower — because slightly lower accuracy at scale can still be acceptable if the error handling is good.
Latency matters when the agent is in a synchronous loop — a user is waiting, or the next step depends on the result. For background jobs — overnight batch processing, scheduled reports, async triage — latency usually doesn't matter much at all. Models that are slower but better at reasoning are a reasonable choice for the cases where you have time.
Context window size is relevant for tasks involving long documents: contracts, transcripts, support histories, multi-page forms. Some tasks genuinely need a large context window. Others only seem to, because the prompt wasn't designed to be efficient.
Where the "bigger is better" assumption breaks down
Larger models are usually better at open-ended reasoning. They're not always better at structured extraction or classification, especially when the task is well-defined and the inputs are consistent.
For a task like "read this email and classify it as one of six request types, then extract the client name and deadline if present," a well-prompted smaller model will often outperform a larger one — not because it's smarter, but because the task doesn't require the kind of general reasoning larger models excel at. It requires consistent, fast, cheap pattern matching. Smaller models can be tuned or prompted to be very reliable at this.
The failure mode is using a large model for everything because it feels safer, then being surprised when the cost is unsustainable for the volume or the latency is too high for the use case.
Evaluation before deployment
The only way to know if a model is right for a task is to run it against real data from that task. Not a benchmark. Not a few examples in a playground. A sample of the actual inputs the system will encounter, with a defined way to measure whether the output is correct.
For classification tasks: what percentage of cases are classified correctly? What types of errors occur? Are the errors random or systematic? A systematic error on a particular input pattern is usually fixable with better prompt design or few-shot examples.
For generation tasks: defining "correct" is harder. You need a rubric, a human reviewer, or both, for at least a sample of the output. This takes time. It's unavoidable if you want to deploy something that works reliably.
The vendor lock-in question
It's real but manageable. The practical mitigation is to abstract the model call behind an interface in your code so that switching providers means changing one file rather than refactoring the whole system. We build this as a default. It adds a small amount of upfront complexity and removes a large amount of future risk.
The bigger lock-in risk, which people underestimate, is prompt dependency. Prompts that work well on one model often don't transfer directly to another. If you're likely to switch providers — because pricing changes, because a better option appears — invest time in understanding why the prompt works rather than just that it works. The reasoning makes it portable.