Picking the open-weight value king: DeepSeek V4, MiniMax M3, Kimi K2.6, and GLM 5.2

In 2026 almost every best-value model is open-weight, and most come from Chinese labs. DeepSeek V4 Flash ($0.09, AA 40), MiniMax M3 ($0.30, AA 44, 1M context), Kimi K2.6 (strong agentic), and GLM 5.2 ($1.2, AA 51, coding 68.8) each occupy a different niche. Using real pricing and coding/agentic scores, we place all four on one price–capability map and say which to pick for which job.

1. Why open-weight models took over the value tier in 2026

Two years ago the implied deal in the LLM market was simple: pay more, get better. The frontier was owned by OpenAI and Anthropic, and every open-weight alternative came with meaningful capability compromises you had to rationalize. That deal is now broken — at least in the value tier.

The shift happened fast and almost entirely through Chinese labs. DeepSeek's V4 architecture proved that mixture-of-experts training at scale could deliver frontier-adjacent quality at a fraction of the compute cost. Once that paper dropped, every serious lab either replicated the technique or fell behind. MiniMax, Moonshot AI, Zhipu AI (the team behind GLM), and Xiaomi all shipped competitive models within the same six-month window. The result: by mid-2026, if you sort this site's best-value ranking by intelligence-per-dollar, the top tier is almost entirely open-weight models from Chinese labs.

"Open-weight" here means the model weights are downloadable and self-hostable. But in practice, the majority of developers consume these models through API endpoints — either the lab's own cloud or a routing layer like OpenRouter — because self-hosting at anything below very high sustained throughput is still operationally expensive. The open-weight label matters most as a price signal: these labs don't need to recoup a $100M training run through margin, so they price API access more aggressively than closed US providers can afford to.

For context: GPT-5.4 sits at $2.50/$15 per 1M tokens (input/output) with an Artificial Analysis Intelligence Index (AA) score of 51.4. Claude Opus 4.8 costs $5/$25 with AA 55.7. The open-weight models we're examining here reach 70–95% of those capability scores at 5–30× lower price. That gap is not noise — it's the central fact of the 2026 API market.

2. The four contenders, one by one

These four models don't compete head-to-head across every dimension. Each has carved a distinct niche, and understanding that niche is the first step to picking the right one.

DeepSeek V4 Flash ($0.09/$0.18, AA 40.3, 1M context) is the cheapest "actually decent" model on this entire site. The AA score of 40.3 puts it meaningfully below the others in this comparison, but it's not a toy. For classification, structured extraction, simple Q&A, RAG retrieval, and any workload where you're running millions of requests and cost is the binding constraint, V4 Flash is the obvious starting point. The $0.09 input price is so low that even moderate quality is cost-effective at scale.

DeepSeek V4 Pro ($0.435/$0.87, AA 44.3, coding 47.5, 1M context) steps up to the mid-range. The AA score is 10% higher than Flash, the coding index sits at 47.5, and the price is still 80–90% below GPT-5.4. V4 Pro is the natural choice when you've stress-tested Flash on your workload and found the quality gap tangible — particularly for multi-hop reasoning, structured code generation, or tasks where output quality directly translates to downstream value.

MiniMax M3 ($0.30/$1.20, AA 44.4, coding 43.4, 1M context) is the sleeper pick of this comparison. The AA score is neck-and-neck with V4 Pro despite being 30% cheaper on input. Its coding index is slightly lower than V4 Pro's (43.4 vs 47.5), but the combination of a true 1M context window, strong general capability, and that price point makes it the default recommendation for long-context workloads — document analysis, codebase Q&A over large repos, long-form summarization chains — where you need to fit a lot of tokens per call without the bill exploding.

Z.ai GLM 5.2 ($1.20/$4.20, AA 51.1, coding 68.8, 1M context) is the capability leader of this group. The AA score of 51.1 is within striking distance of GPT-5.4 (51.4) — essentially the same general intelligence tier. But the headline number is the coding index: 68.8, which is not just the highest in this group but higher than most US flagship models. If coding quality is the axis that matters for your workload, GLM 5.2 at $1.20/$4.20 is a genuinely disruptive option against GPT-5.4 at $2.50/$15.

MoonshotAI Kimi K2.6 ($0.67/$3.50, AA 42.8, coding 47.1, 262K context) occupies a different kind of niche. The AA score is the lowest in the group after Flash, but Kimi's architecture and training have been specifically tuned for agentic workflows — multi-step task execution, tool use, browser automation. Moonshot also offers Kimi K2.7 Code ($0.74/$3.50, coding 45.6) for workloads where coding quality matters slightly more than agentic breadth. The 262K context window is the only hard constraint; if your task fits inside it, Kimi is worth testing for agent pipelines before defaulting to a pricier option.

3. Price vs capability map

The table below puts all six models on the same axes. "Input" and "Output" are USD per 1M tokens. "AA" is the Artificial Analysis Intelligence Index (0–100 scale, current maximum approximately 60). "Cod" is the AA Coding Index. "Ctx" is max context window. The two US premium models appear at the bottom as reference points.

Model	Input $/M	Output $/M	AA	Cod	Ctx
DeepSeek V4 Flash	$0.09	$0.18	40.3	—	1M
Xiaomi MiMo-V2.5	$0.14	$0.28	40.1	42.1	1M
MiniMax M3	$0.30	$1.20	44.4	43.4	1M
DeepSeek V4 Pro	$0.435	$0.87	44.3	47.5	1M
Kimi K2.6	$0.67	$3.50	42.8	47.1	262K
Kimi K2.7 Code	$0.74	$3.50	—	45.6	262K
Z.ai GLM 5.2	$1.20	$4.20	51.1	68.8	1M
Qwen3.7 Max	$1.25	$3.75	46.0	50.1	1M
GPT-5.4 (ref)	$2.50	$15.00	51.4	—	—
Claude Opus 4.8 (ref)	$5.00	$25.00	55.7	—	—

Two things jump out of this table. First, MiniMax M3 and DeepSeek V4 Pro have almost identical AA scores (44.4 vs 44.3) despite M3 being 30% cheaper on input. The output pricing tilts the other way — M3's $1.20 output is 38% higher than V4 Pro's $0.87 — so the winner depends on your input/output ratio. If you're doing long-context reads with short answers (RAG, summarization), M3 wins. If you're generating long outputs from short prompts (code generation, drafting), V4 Pro wins.

Second, GLM 5.2 at $1.20 input delivers AA 51.1 — basically the same general intelligence as GPT-5.4 at $2.50 input, and with a dramatically lower output price ($4.20 vs $15). The only category where it's even worth considering GPT-5.4 over GLM 5.2 for API workloads is if you have hard requirements around OpenAI compliance, specific tool-use behaviors, or ecosystem integrations that don't yet support non-OpenAI endpoints. Capability is not the reason.

4. Coding vs agentic: who wins which job

The AA Coding Index measures performance on programming benchmarks: code completion, debugging, algorithm implementation, test generation. High coding scores don't automatically predict agentic performance — the ability to orchestrate multi-step plans, use tools reliably, and recover from errors is a different skill that can diverge significantly from raw code quality.

For pure coding tasks — autocomplete, code review, test generation, bug fixing in isolated files — GLM 5.2's coding index of 68.8 is the clearest signal. That score is not just the highest in this group; it's higher than Qwen3.7 Max (50.1), higher than either Kimi variant (47.1/45.6), and higher than V4 Pro (47.5). If the quality of generated code is what matters most, GLM 5.2 is the answer even at its premium price — because you're still paying roughly 50% of GPT-5.4's input rate and 28% of its output rate. Use the cost calculator to quantify exactly how much you'd save at your monthly token volume.

For agentic workloads — browser automation, multi-step research pipelines, tool-use chains, autonomous task execution — the picture is more nuanced. Kimi K2.6's architecture is explicitly tuned for this pattern. Its AA score of 42.8 is not impressive in isolation, but Moonshot has published benchmark results showing strong performance specifically on agent-relevant tasks: following long instruction chains, recovering from tool errors, maintaining state across many turns. If you're building an agent that needs to interact with external systems through a tool interface, Kimi is worth benchmarking even if its general intelligence score looks mediocre by comparison.

Qwen3.7 Max ($1.25/$3.75, AA 46.0, coding 50.1) deserves mention here as an alternative to GLM 5.2 for coding tasks. Its coding index is 27% lower than GLM 5.2's — a meaningful gap — but it's still strong enough for many practical workloads and costs nearly the same on input ($1.25 vs $1.20). The deciding factor between them is almost always whether coding index matters more than general intelligence — and GLM 5.2 leads on both, with higher AA (51.1 vs 46.0) and a far higher coding index (68.8 vs 50.1) at almost the same input price, making it the dominant choice whenever budget is not the binding constraint. Qwen3.7 Max earns its place mainly on Alibaba ecosystem reach and its own 1M context window.

Xiaomi MiMo-V2.5 ($0.14/$0.28, AA 40.1, coding 42.1) rounds out the field as a budget option for coding tasks. The coding index isn't competitive with GLM or even Kimi, but for code-adjacent tasks at very high volume — linting suggestions, boilerplate generation, regex construction — it's worth a look before defaulting to DeepSeek V4 Flash.

5. Context window and deployment flexibility

Every model in this group except Kimi K2.6 and K2.7 Code ships with a 1M-token context window. That's enough to fit entire codebases, very long document corpora, or extended conversation histories without chunking. Kimi's 262K window is still generous by most historical standards, but it does become a real constraint for the class of workloads (large-repo code understanding, full-book analysis) where 1M matters.

On API access: all six models are available via their labs' own APIs and most are routable through OpenRouter. The compare tool on this site lets you put any two side by side with live pricing. Latency varies by provider region, traffic load, and model quantization — benchmark your specific workload before committing to one endpoint.

On self-hosting: the weights for DeepSeek V4 variants and some GLM releases are publicly available. The self-host break-even is approximately 50–100M tokens per day in sustained throughput; below that, GPU cluster costs (hardware amortization, power, ops labor) exceed API fees for any of these models. Most teams running less than that should stay on managed APIs. The operational headaches of model updates, quantization decisions, and uptime management are real costs that rarely appear in the back-of-envelope math.

MiniMax M3's weights are not as widely distributed as DeepSeek's, making it more API-first. Kimi K2.6 is currently API-only. GLM 5.2 has both managed API and weight download options, though the quantized versions available for consumer hardware show measurable quality degradation relative to the hosted version.

6. Verdict: pick by scenario

Rather than declaring a single winner — which would be wrong, because these models genuinely suit different use cases — here's the decision tree. Use the best-value ranking and calculator to check the numbers for your specific token volume.

Lowest cost, adequate quality (classification, extraction, RAG retrieval at scale): Start with DeepSeek V4 Flash at $0.09/$0.18. It's the cheapest model on the site that still passes a quality bar for production use. If Flash fails your quality test, move to MiniMax M3 before V4 Pro — you get meaningfully higher AA at a similar or lower price depending on your I/O ratio.

Long-context workloads (document analysis, large-repo Q&A, book summarization): MiniMax M3 ($0.30/$1.20). True 1M context, AA 44.4, and a price low enough that even 500K-token calls don't bankrupt your bill. If you need the full 1M window on every call and output quality matters more than cost, GLM 5.2 is the upgrade.

Coding quality is the primary axis: GLM 5.2 ($1.20/$4.20, coding 68.8). No open-weight alternative in this group comes close on the coding benchmark. At roughly half the input price and 28% of the output price of GPT-5.4, the value case is clear.

Agentic pipelines and multi-step tool use: Kimi K2.6 ($0.67/$3.50) is worth testing first if your context fits in 262K. If you need a 1M window for your agent context, MiniMax M3 is the next option before jumping to GLM 5.2.

When to pay for US flagship models: The capability gap between GLM 5.2 (AA 51.1) and GPT-5.4 (AA 51.4) is essentially noise. Claude Opus 4.8's AA of 55.7 is a real gap — roughly 8–9% higher general intelligence. If your task requires that last 8% and you've verified it on your own evaluation set, pay for it. Otherwise, the open-weight value tier wins.

Written by Allen Pan. Corrections or questions welcome — allen@xyzsleep.com.