How do they differ?
Model Router and Cascading both exist to solve the same economic problem: not every query needs your most expensive model. Simple questions can be handled by a cheaper, faster model. Complex questions need the full power of a frontier model. The difference is in how they decide which model to use.
A Model Router classifies the query upfront and sends it directly to the appropriate model. One classification, one model call. The query never touches the other models. Think of it like a hospital triage nurse who evaluates you at the door and sends you to the right department.
Cascading tries the cheapest model first and escalates only if the result is not good enough. The query starts at the bottom of the model hierarchy and works its way up until the output meets a quality threshold. Think of it like calling tech support, starting with tier 1, and getting transferred to tier 2 only when tier 1 cannot resolve the issue.
Both strategies can cut your LLM costs by 40% to 70% compared to routing everything through a frontier model. But they make different tradeoffs on latency, implementation complexity, and failure modes.
| Dimension | Model Router | Cascading |
|---|---|---|
| Decision point | Before generation (classify input) | After generation (evaluate output) |
| Model calls per query | Exactly one (plus classifier) | One to N (depends on escalation) |
| Latency (best case) | Low. Single model call. | Low. Cheap model answers immediately. |
| Latency (worst case) | Same as best case. | High. Every model in the chain runs. |
| Cost (best case) | Low. Cheap model for simple queries. | Very low. Cheapest model handles it. |
| Cost (worst case) | Moderate. Expensive model for complex queries. | High. All models run before the expensive one. |
| Classification accuracy dependency | Critical. Wrong routing = wrong model. | Moderate. Bad output triggers escalation. |
| Implementation complexity | Moderate. Need a good classifier. | Moderate. Need a quality evaluator. |
How Model Router works
The router sits in front of your model pool and acts as a dispatcher. When a query arrives, the router analyzes it and assigns it to a model tier. The analysis can be done in several ways:
Rule-based routing. Simple heuristics like query length, presence of specific keywords, or whether the query contains code. Fast and cheap but limited in accuracy.
Classifier model. A lightweight model (or even an embedding model with a classification head) that predicts query complexity. More accurate than rules but adds a small inference cost and latency.
LLM-as-judge routing. Use a cheap LLM to evaluate the query and decide which model should handle it. More flexible than a fixed classifier but adds the cost of an LLM call.
The key architectural decision is defining your model tiers and the routing criteria for each. A common setup:
- Tier 1 (cheap/fast): Haiku, GPT-4o-mini, or a fine-tuned small model. Handles simple lookups, short answers, basic classification.
- Tier 2 (balanced): Sonnet, GPT-4o. Handles most general queries, summarization, moderate reasoning.
- Tier 3 (premium): Opus, o1, o3. Handles complex reasoning, code generation, nuanced analysis.
The router maps incoming queries to tiers based on predicted complexity. The mapping can be trained on historical data: take your query logs, have a human label the minimum model tier that produces an acceptable answer, and train a classifier on those labels.
How Cascading works
Cascading takes a fundamentally different approach. Instead of predicting which model to use, it tests models in order of cost.
The cascade starts with the cheapest model. That model generates a response. A quality evaluator then checks whether the response meets the required standard. If it does, the response is returned. If it does not, the query is sent to the next model in the hierarchy, and the process repeats.
The quality evaluator is the critical component. It can be:
Confidence-based. Check the model's own confidence signals (logprobs, perplexity). Low confidence triggers escalation.
Rule-based. Check for specific quality indicators: response length, presence of hedging language ("I am not sure"), refusal patterns, or format compliance.
LLM-as-judge. Use a separate model to evaluate whether the response is adequate. More accurate but adds latency and cost.
Task-specific. For structured outputs, validate against a schema. For code, run tests. For math, verify the calculation. These are the most reliable evaluators because they check correctness directly.
The cascade typically has two or three tiers. More than three adds latency without much benefit because if a query fails two models, it almost certainly needs the best one.
When to use Model Router
The router pattern works best when you can predict query complexity from the input alone, before any model processes it.
Queries with clear complexity signals. If simple queries tend to be short and complex queries tend to be long, or if certain keywords reliably indicate complexity, a router can classify accurately. Customer support systems often have this property. "What are your hours?" is obviously simple. "I need to dispute a charge from three months ago that was partially refunded but the refund amount is wrong" is obviously complex.
When latency is critical. The router adds minimal latency (a classification step) and then makes exactly one model call. There is no worst-case escalation path that might add multiple model calls. If you have strict latency SLAs, this predictability is valuable.
When you have historical data. If you have logs of queries and the model tier that produced acceptable answers, you can train a highly accurate classifier. The router gets better over time as you collect more data.
Heterogeneous model capabilities. If your models are specialized (one is fine-tuned for code, another for conversation, a third for analysis), a router that matches queries to specializations produces better results than cascading through generalist models.
High-volume systems. When you process millions of queries per day, even small per-query savings multiply. The router's fixed overhead (one classification) scales better than cascading's variable overhead.
When to use Cascading
Cascading works best when you cannot judge difficulty from the input, only from the output.
Unpredictable query complexity. If a short, simple-looking query might require deep reasoning (and you cannot tell until you try), cascading handles this naturally. The cheap model tries, fails, and the expensive model takes over.
When output quality is verifiable. Cascading requires a quality evaluator. If you can check whether the output is good (schema validation, test execution, reference comparison), cascading works well. If quality is subjective and hard to evaluate automatically, cascading is harder to implement reliably.
Cost optimization is the primary goal. Cascading can achieve better cost savings than routing because it always starts with the cheapest model. If 70% of queries are genuinely simple, the cheap model handles them and the expensive model is never called. A router might misclassify some of those queries and send them to a more expensive tier unnecessarily.
Gradual rollout of cheaper models. When you are evaluating whether a new, cheaper model can handle your workload, cascading lets you try it first for every query and fall back to the proven model when it cannot. This is a low-risk way to test model migrations.
When you do not have training data for a classifier. Building a good router requires labeled examples of query complexity. If you are building a new system without historical data, cascading works out of the box because it does not need a pre-trained classifier.
Can they work together?
Yes, and the combination covers the weaknesses of each pattern.
Router with cascading fallback. The router classifies the query and sends it to a model tier. If the response from that tier fails a quality check, it cascades up to the next tier. This handles router misclassification gracefully. Most queries take the fast, single-model path. Misclassified queries get a second chance instead of returning a bad result.
Cascading with a learned router. Start with pure cascading. Log which queries end up at which tier. Use that data to train a router. Over time, the router handles more queries directly, reducing the need for cascading. This is a practical bootstrap strategy.
Router for broad classification, cascading within tiers. The router separates queries into broad categories (code, conversation, analysis). Within each category, a cascade tries the cheaper specialized model first and escalates to the general-purpose frontier model if needed.
Common mistakes
Building a router without enough training data. A router is only as good as its classifier. If you train on a few hundred examples, the classifier will misroute frequently. You need thousands of labeled examples across all complexity levels. Start with cascading and collect the data you need to build a router later.
Cascading without a quality evaluator. Some implementations cascade based on response time or token count, which are poor proxies for quality. The evaluator needs to check whether the response actually answers the question correctly. Without this, you either escalate too often (wasting the cost savings) or too rarely (returning bad answers).
Too many cascade levels. Three tiers are usually the maximum that makes sense. Each additional tier adds latency for queries that escalate through it. And if a query fails three models, adding a fourth marginal model rarely helps.
Ignoring the cost of the classifier or evaluator. A router that uses GPT-4o as its classifier adds meaningful cost and latency to every query, even the simple ones. Use the cheapest effective method for classification. An embedding model with a classification head is often sufficient and much cheaper than an LLM call.
Not monitoring routing or escalation rates. If 50% of your queries escalate through the full cascade, something is wrong. Either your cheap model is worse than expected, your quality evaluator is too strict, or your query mix has shifted. Monitor these rates and adjust thresholds regularly.
Static thresholds. Both patterns benefit from dynamic adjustment. Query mixes shift over time. Model capabilities change with updates. Review and recalibrate your routing rules and quality thresholds at least monthly.
Cost modeling
Before choosing between these patterns, model the costs for your specific query distribution.
For a Model Router, the cost per query is: classifier cost + model cost (at the routed tier). The total cost is predictable if you know your query complexity distribution.
For Cascading, the cost per query is: cheapest model cost + (escalation probability * next tier cost) + (double escalation probability * top tier cost) + evaluator cost per level. The total cost depends heavily on the escalation rate.
If 80% of your queries are simple, cascading will likely be cheaper because those queries only incur the cheapest model cost. If the query mix is more evenly distributed, routing might be cheaper because it avoids the wasted tokens from models that generate rejected responses during cascading.
Run the numbers with your actual data before committing to an architecture. A spreadsheet with your query volume, complexity distribution, and per-model pricing will tell you more than any rule of thumb.
References
- Ding, D. et al. (2024). "Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing." arXiv:2404.14618.
- Chen, L. et al. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv:2305.05176.
- Anthropic documentation on model selection and routing.
- OpenAI documentation on model capabilities and pricing tiers.