How do they differ?
Both patterns sit at the front door of your AI system and make a routing decision before the main work begins. But they answer fundamentally different questions. Semantic Router looks at a user query and asks, "What kind of task is this?" Model Router looks at the same query and asks, "How hard is this task?"
That distinction matters more than it sounds. Semantic Router is about dispatching to the right handler, tool, or pipeline. Model Router is about dispatching to the right model tier within a given handler. One decides the destination. The other decides the vehicle.
Consider a customer support system. A user writes, "I want to cancel my subscription." Semantic Router recognizes this as a cancellation intent and routes it to the cancellation pipeline, not the billing FAQ pipeline, not the technical support pipeline. Once inside the cancellation pipeline, Model Router evaluates whether this is a straightforward cancel (route to a fast, cheap model) or a complex retention scenario with contract nuances (route to a more capable model).
| Dimension | Semantic Router | Model Router |
|---|---|---|
| Core question | What kind of task is this? | How complex is this task? |
| Classification target | Intent or topic category | Difficulty or complexity tier |
| Output | Pipeline, tool, or handler selection | Model selection (e.g., GPT-4o-mini vs Claude Opus) |
| Typical implementation | Embedding similarity, lightweight classifier, or keyword matching | Rule-based heuristics, token count estimation, or a small classifier |
| Latency budget | Must be fast (sits before all processing) | Must be fast (sits before main LLM call) |
| Failure mode | Wrong pipeline chosen, completely wrong behavior | Wrong model chosen, either wasted cost or degraded quality |
| Training data | Labeled examples of user intents | Labeled examples of query complexity |
The classification axis
Semantic Router typically works by embedding the incoming query and comparing it against a set of reference embeddings for each intent category. If the query is closest to the "code generation" cluster, it goes to the code pipeline. If it is closest to "summarization," it goes there instead. Some implementations use lightweight classifiers or even keyword rules, but embedding similarity is the most common approach because it generalizes well to paraphrases.
Model Router works differently. It evaluates signals like query length, presence of multi-step reasoning indicators, domain-specific terminology density, or even a quick probe with a small model to estimate whether the task needs a heavyweight model. The output is not "which pipeline" but "which tier." Tier 1 might be a small, fast model for simple lookups. Tier 2 might be a mid-range model for standard generation. Tier 3 is the most capable (and expensive) model for complex reasoning.
When to use Semantic Router
Semantic Router is the right choice when your system has multiple distinct capabilities and you need to figure out which one the user is asking for.
- Multi-tool agents. If your agent can search the web, query a database, generate code, and summarize documents, Semantic Router determines which tool to invoke before any LLM call happens.
- Domain-specific pipelines. A healthcare platform might have separate pipelines for symptom checking, appointment scheduling, and insurance questions. Each has different context, different prompts, and different safety guardrails.
- Hybrid retrieval systems. When you have multiple knowledge bases (product docs, legal documents, community forums), Semantic Router picks which index to query.
- Cost gating. Some intents do not need an LLM at all. Semantic Router can detect "What are your hours?" and return a static response without touching any model.
The key signal is this: if your system has meaningfully different behaviors depending on what the user wants, you need Semantic Router.
When to use Model Router
Model Router is the right choice when you have a single pipeline but want to optimize the cost-quality tradeoff based on how difficult each request is.
- High-volume APIs. If you process thousands of requests per minute, sending everything to your most expensive model is wasteful. Model Router lets you handle 70-80% of simple requests with a cheap model and reserve expensive models for the hard cases.
- Latency-sensitive applications. Smaller models respond faster. For simple queries where quality is not meaningfully different between model tiers, picking the faster model improves user experience.
- Budget constraints. When you have a fixed monthly budget for LLM API calls, Model Router stretches that budget by matching spend to difficulty.
- Quality preservation. The flip side of cost savings. Without Model Router, teams often default to a mid-tier model for everything, which means complex queries get worse results than they should. Model Router lets you use the best model where it actually matters.
The key signal: if you are running one pipeline but want to be smart about which model handles each request, you need Model Router.
Can they work together?
Yes, and in most production systems they should. The natural architecture is a two-stage routing layer:
- Stage 1: Semantic Router classifies the intent and picks the pipeline.
- Stage 2: Model Router within each pipeline classifies complexity and picks the model.
This is not theoretical. Consider a coding assistant that supports multiple languages. Semantic Router first determines whether the user wants code generation, code review, or debugging help. That selects the pipeline (different system prompts, different tool availability, different output formats). Then within the code generation pipeline, Model Router evaluates whether this is a simple utility function (use a fast model) or a complex algorithm with edge cases (use the strongest model).
The two routers use different classifiers trained on different labels. Semantic Router is trained on intent labels. Model Router is trained on complexity labels. They operate independently and compose naturally.
User Query
│
▼
┌──────────────┐
│ Semantic │ → "code generation" intent
│ Router │
└──────┬───────┘
│
▼
┌──────────────┐
│ Code Gen │
│ Pipeline │
│ │
│ ┌──────────┐ │
│ │ Model │ │ → "high complexity" → Claude Opus
│ │ Router │ │
│ └──────────┘ │
└──────────────┘
Implementation notes for the combined approach
Keep both routers lightweight. The whole point of routing is to save time and money, so if your routing layer itself is expensive, you have defeated the purpose. Embedding-based Semantic Router adds maybe 5-20ms. A rule-based Model Router adds under 1ms. Even a small classifier for Model Router should stay under 10ms.
Monitor both routers independently. Track Semantic Router accuracy (did it pick the right pipeline?) and Model Router accuracy (did the chosen model produce acceptable quality?). These are different failure modes with different remediation strategies.
Common mistakes
Conflating the two patterns. The most common mistake is building one router that tries to do both jobs. You end up with a classifier that has labels like "simple-code-generation" and "complex-summarization," which mixes two orthogonal dimensions. This creates a combinatorial explosion of categories and makes the classifier harder to train and maintain.
Over-engineering Semantic Router for few intents. If you only have two or three pipelines, a simple keyword match or regex might work better than embedding similarity. Do not add an embedding model and a vector store just to route between "search" and "generate."
Ignoring the fallback path. Both routers need a fallback for when confidence is low. For Semantic Router, the fallback might be a general-purpose pipeline. For Model Router, the fallback should be the most capable model (not the cheapest). When in doubt about complexity, overshoot on quality rather than undershoot.
Static complexity rules that do not adapt. A common Model Router implementation uses token count as a proxy for complexity. But a 10-token query can be incredibly hard ("Prove P != NP informally") and a 500-token query can be trivial (a long but simple data transformation). Use multiple signals, not just length.
Not measuring cost savings. Teams implement Model Router but never measure whether it actually reduced costs. Track the percentage of requests going to each tier and the quality scores for each tier. If your cheap model handles 80% of traffic with no quality drop, that is a clear win. If quality drops significantly, your complexity classifier needs retraining.
Routing latency that exceeds the savings. If your Model Router takes 200ms to decide and the difference between model tiers is 300ms, you have only saved 100ms net. Measure end-to-end, not just the model call.
References
- Semantic Router pattern on genaipatterns.dev
- Model Router pattern on genaipatterns.dev
- Aurelio AI's
semantic-routerlibrary for embedding-based intent classification - Martian's Model Router for automatic LLM selection based on query complexity
- OpenRouter's model routing approach for cost-optimized inference
- Anthropic's documentation on model selection and tiered usage strategies