2026-03-03 · Lab Notes ◆ Experimental

Model Selection Framework

4 models. 4 evaluation dimensions. 40,000 planned API calls. Moving past vibes to measured production selection.

Models 4

Dimensions 4

Planned Calls ~40K per model

Timeline 8-11 weeks

The Gap

Public benchmarks MMLU, HumanEval, Arena Elo

≠

Production reality Latency variance, retries, degradation

Benchmark suites miss what dominates operational costs: latency variance at scale, retry rates that multiply nominal costs 1.5-3x, context degradation beyond advertised specs, and output consistency that determines pipeline trust.

Models Under Evaluation

Model	Provider	Context	Input/1M	Output/1M
Claude Opus 4.6	Anthropic	200K	$15.00	$75.00
Claude Sonnet 4.6	Anthropic	200K	$3.00	$15.00
Kimi 2.5	Fireworks	256K	$0.80	$0.80
MiniMax 2.5	Fireworks	1M	$0.26	$0.26

Price Spread Opus = 28x MiniMax (input)

Price Spread Opus = 288x MiniMax (output)

Four Evaluation Dimensions

1 --test quality distribution — 200 tasks x 10 samples x 4 models = 8,000 outputs

Mean, std dev, percentile distributions. Not point estimates — variance is the signal. Human-rated on 1-5 rubric with inter-rater reliability.

2 --test latency under load — TTFT + TPS at p50/p95/p99 across concurrency levels

1, 10, 50, 100+ concurrent requests. Context sizes: 1K, 4K, 16K, 32K, 64K. Multiple geographic origins. Rate limit behavior included.

3 --test retry analysis — 5,000 requests per model, true cost = nominal x (1 + retry rate)

Success on first try, success after retry, failed after max retries, timeout. Failure pattern analysis by task type and context size.

4 --test context window reality — needle-in-haystack + reasoning at 4K/16K/32K/64K

Retrieval accuracy at varying positions. Latency degradation as context grows. Do advertised specs match real usage?

Current Allocation

Model	Role	Rationale
Kimi 2.5	Default prototyping	Lowest cost, acceptable quality, fast TTFT
Sonnet 4.6	Customer-facing	Safety training, edge-case risk reduction
Opus 4.6	Complex reasoning	Reserved for tasks where failure cost exceeds premium
MiniMax 2.5	Batch processing	Price enables volume experimentation

These allocations are provisional. Sufficient data does not yet exist to confirm whether Kimi's consistency advantage outweighs Sonnet's safety benefits, or whether MiniMax's retry rate makes it truly cheaper.

Open Questions

? --q1 Does Kimi's lower variance translate to fewer retries and less verification?

? --q2 At what volume does Opus's premium become indefensible?

? --q3 Does MiniMax's 1M context work reliably at full length?

? --q4 Fireworks latency penalty vs theoretical direct access?

? --q5 Anthropic direct API reliability vs Fireworks-hosted for our geo?

Planned Experiments

Experiment	Sample Size	Timeline
Quality distribution	8,000 outputs	4 weeks
Latency under load	10,000 requests/model	2 weeks
Retry analysis	5,000 requests/model	3 weeks
Context window	700 tests	2 weeks

Total ~40K calls/model

Human Rating ~40 hours

Full Timeline 8-11 weeks

◉ Read standard version → ⬡ Read agent version →