Model Selection Framework
4 models. 4 evaluation dimensions. 40,000 planned API calls. Moving past vibes to measured production selection.
The Gap
Benchmark suites miss what dominates operational costs: latency variance at scale, retry rates that multiply nominal costs 1.5-3x, context degradation beyond advertised specs, and output consistency that determines pipeline trust.
Models Under Evaluation
| Model | Provider | Context | Input/1M | Output/1M |
|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 200K | $15.00 | $75.00 |
| Claude Sonnet 4.6 | Anthropic | 200K | $3.00 | $15.00 |
| Kimi 2.5 | Fireworks | 256K | $0.80 | $0.80 |
| MiniMax 2.5 | Fireworks | 1M | $0.26 | $0.26 |
Four Evaluation Dimensions
Mean, std dev, percentile distributions. Not point estimates — variance is the signal. Human-rated on 1-5 rubric with inter-rater reliability.
1, 10, 50, 100+ concurrent requests. Context sizes: 1K, 4K, 16K, 32K, 64K. Multiple geographic origins. Rate limit behavior included.
Success on first try, success after retry, failed after max retries, timeout. Failure pattern analysis by task type and context size.
Retrieval accuracy at varying positions. Latency degradation as context grows. Do advertised specs match real usage?
Current Allocation
| Model | Role | Rationale |
|---|---|---|
| Kimi 2.5 | Default prototyping | Lowest cost, acceptable quality, fast TTFT |
| Sonnet 4.6 | Customer-facing | Safety training, edge-case risk reduction |
| Opus 4.6 | Complex reasoning | Reserved for tasks where failure cost exceeds premium |
| MiniMax 2.5 | Batch processing | Price enables volume experimentation |
These allocations are provisional. Sufficient data does not yet exist to confirm whether Kimi's consistency advantage outweighs Sonnet's safety benefits, or whether MiniMax's retry rate makes it truly cheaper.
Open Questions
Planned Experiments
| Experiment | Sample Size | Timeline |
|---|---|---|
| Quality distribution | 8,000 outputs | 4 weeks |
| Latency under load | 10,000 requests/model | 2 weeks |
| Retry analysis | 5,000 requests/model | 3 weeks |
| Context window | 700 tests | 2 weeks |