LabNotes
2026-03-03 · Lab Notes ◆ Experimental

Model Selection Framework

4 models. 4 evaluation dimensions. 40,000 planned API calls. Moving past vibes to measured production selection.

Models 4
Dimensions 4
Planned Calls ~40K per model
Timeline 8-11 weeks

The Gap

Public benchmarks MMLU, HumanEval, Arena Elo
Production reality Latency variance, retries, degradation

Benchmark suites miss what dominates operational costs: latency variance at scale, retry rates that multiply nominal costs 1.5-3x, context degradation beyond advertised specs, and output consistency that determines pipeline trust.


Models Under Evaluation

ModelProviderContextInput/1MOutput/1M
Claude Opus 4.6Anthropic200K$15.00$75.00
Claude Sonnet 4.6Anthropic200K$3.00$15.00
Kimi 2.5Fireworks256K$0.80$0.80
MiniMax 2.5Fireworks1M$0.26$0.26
Price Spread Opus = 28x MiniMax (input)
Price Spread Opus = 288x MiniMax (output)

Four Evaluation Dimensions

1 --test quality distribution — 200 tasks x 10 samples x 4 models = 8,000 outputs

Mean, std dev, percentile distributions. Not point estimates — variance is the signal. Human-rated on 1-5 rubric with inter-rater reliability.

2 --test latency under load — TTFT + TPS at p50/p95/p99 across concurrency levels

1, 10, 50, 100+ concurrent requests. Context sizes: 1K, 4K, 16K, 32K, 64K. Multiple geographic origins. Rate limit behavior included.

3 --test retry analysis — 5,000 requests per model, true cost = nominal x (1 + retry rate)

Success on first try, success after retry, failed after max retries, timeout. Failure pattern analysis by task type and context size.

4 --test context window reality — needle-in-haystack + reasoning at 4K/16K/32K/64K

Retrieval accuracy at varying positions. Latency degradation as context grows. Do advertised specs match real usage?


Current Allocation

ModelRoleRationale
Kimi 2.5Default prototypingLowest cost, acceptable quality, fast TTFT
Sonnet 4.6Customer-facingSafety training, edge-case risk reduction
Opus 4.6Complex reasoningReserved for tasks where failure cost exceeds premium
MiniMax 2.5Batch processingPrice enables volume experimentation

These allocations are provisional. Sufficient data does not yet exist to confirm whether Kimi's consistency advantage outweighs Sonnet's safety benefits, or whether MiniMax's retry rate makes it truly cheaper.


Open Questions

? --q1 Does Kimi's lower variance translate to fewer retries and less verification?
? --q2 At what volume does Opus's premium become indefensible?
? --q3 Does MiniMax's 1M context work reliably at full length?
? --q4 Fireworks latency penalty vs theoretical direct access?
? --q5 Anthropic direct API reliability vs Fireworks-hosted for our geo?

Planned Experiments

ExperimentSample SizeTimeline
Quality distribution8,000 outputs4 weeks
Latency under load10,000 requests/model2 weeks
Retry analysis5,000 requests/model3 weeks
Context window700 tests2 weeks
Total ~40K calls/model
Human Rating ~40 hours
Full Timeline 8-11 weeks