Which topics does this article cover?

It highlights open source LLM, coding AI, GLM-5.1, MiniMax M3, Kimi K2.6.

The Best Open Source LLMs for Coding Right Now (June 2026)

Q: What is "The Best Open Source LLMs for Coding Right Now (June 2026)" about?

The open-source coding LLM leaderboard looked completely different in April than it does today. MiniMax M3 just shipped June 1st. GLM-5.1 landed in April with an 8-hour autonomous execution claim nobody expected. Here's the real picture as of June 2026.

The leaderboard moved — again. Between April and June 2026, at least five major open-weight coding models shipped, two of them from labs most Western developers haven't heard of. If you read a "best open-source LLMs" guide from three months ago, it's already wrong.

This post is current as of June 8, 2026. Every benchmark number below has a source. Where benchmarks are self-reported by labs (which most are — we'll get into that), we say so.

First: Stop Trusting HumanEval Scores

Everyone above 85% on HumanEval can be ignored for ranking purposes. That includes Qwen, DeepSeek, Codestral, Llama — all of them cross that threshold now. The benchmark is saturated, and there's strong evidence of training data contamination across the board.

The numbers that actually discriminate in 2026:

SWE-bench Verified / SWE-bench Pro — Given a real GitHub repository and a bug report, does the model write a patch that makes the tests pass? SWE-bench Verified (500 tasks) is becoming contaminated itself; SWE-bench Pro (1,865 multi-language tasks) is the harder, cleaner signal right now. OpenAI officially stopped reporting SWE-bench Verified scores in early 2026, citing contamination concerns.

Terminal-Bench 2.1 — Long-horizon CLI tasks: scripting, DevOps automation, multi-step workflows in a terminal. Harder to game than single-function benchmarks.

LiveCodeBench — Competitive programming problems pulled from Codeforces, LeetCode, AtCoder continuously. Contamination-resistant because the dataset updates monthly.

FIM pass@1 — For autocomplete specifically. How often does the model correctly fill in code between a prefix and suffix? This is what your Tab key actually calls.

Agentic Coding (LiveBench) — Multi-step task completion in a coding context. The most useful proxy for "can this model actually work autonomously on a repo?"

With that baseline established — here's the actual landscape.

The June 2026 Leaderboard at a Glance

Model	SWE-bench Pro	Terminal-Bench 2.1	License	Best For
MiniMax M3	59.0%	66.0%	Open-weight (weights pending)	Frontier coding + 1M context
GLM-5.1	58.4%	63.5–66.5%	MIT	Long-horizon agentic engineering
Kimi K2.6	58.6%	—	Modified MIT	Agent swarms, autonomous runs
DeepSeek V4 Pro	~56%	—	MIT	API cost play, 1M context
Qwen3-Coder 480B	~55% (SWE-V: 67–70%)	—	Apache 2.0	Agentic coding via API
Qwen3.6-35B-A3B	73.4% SWE-V	51.5% Terminal	Apache 2.0	Best local model, single GPU
Devstral Small 24B	—	—	Apache 2.0	Local agentic workflows
Codestral 22B	—	FIM: 95.3%	Mistral	IDE autocomplete

Self-reported benchmarks from labs are noted where applicable. SWE-bench Pro scores use standardized scaffolding. Numbers current as of June 2026.

1. MiniMax M3 — The New Entrant You Can't Ignore (Released June 1, 2026)

TL;DR: One week old as of this writing. Claims the top open-weight SWE-bench Pro score. API is live. Weights not yet public. Treat with calibrated skepticism, but the architecture is legitimately interesting.

MiniMax dropped M3 on June 1, 2026 — seven days ago. The API is live through MiniMax's platform and OpenRouter. Open weights and the technical report were promised within ten days of launch, which means they should appear on Hugging Face around June 11.

The numbers MiniMax is claiming: 59.0% on SWE-bench Pro (which they say beats GPT-5.5 and Gemini 3.1 Pro), 66.0% on Terminal-Bench 2.1, 70.06% on OSWorld-Verified for computer use. The model supports a 1M-token context window and is natively multimodal — text, image, and video input from a single architecture.

The architecture story is actually compelling. MiniMax built MSA (MiniMax Sparse Attention), which they say delivers more than 9× faster prefill and more than 15× faster decoding at 1M-token context compared to M2, at 1/20th the per-token compute. Standard full attention has quadratic cost as sequence length grows — MSA partitions the KV cache into blocks and uses a "KV outer gather Q" approach where each KV block is read exactly once, with contiguous memory access. If those speedup numbers hold under independent testing, this matters enormously for long-context coding tasks.

What you need to know right now:

Every benchmark number from M3 is vendor-reported. MiniMax ran their own evaluations using Claude Code as the scaffolding. Independent confirmation hasn't landed yet. TechTimes noted that MiniMax's comparison baseline uses Claude Opus 4.7 rather than the more recently released Opus 4.8 — which places M3 further from the frontier than the launch announcement implies.

Additionally: MiniMax is a Shanghai-based lab operating under Chinese law. If data sovereignty matters to your organization, that's a real consideration before routing production coding traffic through their API.

Promo API pricing at launch: ~$0.30/M input, $1.20/M output (50% promotional discount on the standard $0.60/$2.40 rate).

Bottom line: If you're curious and cost-tolerant about experimental models, M3 via API is worth testing right now. Don't make architectural decisions based on it until the weights ship and independent evals come in.

2. GLM-5.1 — The 8-Hour Model (Released April 2026)

TL;DR: 754B MoE, MIT license, 58.4 on SWE-bench Pro, 200K context. Built specifically for tasks that take hours, not minutes. The single most interesting architectural claim this quarter.

Z.AI (formerly Zhipu AI) released GLM-5.1 in April 2026, and it's been quietly sitting at or near the top of the open-weight coding leaderboard ever since. It achieves 58.4 on SWE-bench Pro, outperforming GPT-5.4 and Claude Opus 4.6 on that specific benchmark (self-reported, but corroborated by LiveBench data).

The capability Z.AI is specifically calling out — and which appears to be real, not just marketing — is long-horizon autonomous execution. The model is designed to work continuously on a single complex task for up to 8 hours. Most LLMs are implicitly optimized for single-turn interactions: give it a clear question, get a clean answer. GLM-5.1 is designed for something harder: multi-stage workflows where the model has to plan, execute dozens of dependent steps, encounter failures, course-correct, and deliver production-grade results.

Z.AI documented the model building a complete Linux desktop system from scratch within 8 hours as a demonstration case. Whether that's reproducible in your specific engineering context is a different question, but the underlying capability — sustained execution without degrading into repetitive loops — is architecturally distinct from what other models do.

The benchmark picture:

SWE-bench Pro: 58.4 (SOTA among open-weight at time of release)
AIME 2026: 95.3
GPQA-Diamond: 86.2
Terminal-Bench 2.0: 63.5 (66.5 with Claude Code scaffolding)
MCP-Atlas (Public Set): 71.8 — directly relevant as MCP becomes standard in production agent systems
CyberGym: 68.7 (up from GLM-5's 48.3 — significant jump)

The 754B MoE architecture is MIT licensed on Hugging Face. For API access, it's available through Z.AI's platform, SiliconFlow, and OpenRouter. Local deployment is supported via SGLang (v0.5.10+), vLLM (v0.19.0+), and KTransformers.

One thing worth noting: The model aligns closely with Claude Opus 4.6 on general intelligence benchmarks. It's not replacing frontier proprietary models — it's matching the previous generation of them for free, which is the actual win.

Bottom line: If you're building agents that need to run long, complex jobs autonomously — not just answer a question and wait for the next prompt — GLM-5.1 is the most purpose-built open-weight model for that use case. The 8-hour claim is unusual enough to investigate.

3. Kimi K2.6 — Best Overall Local Coding Model (If You Have the Hardware)

TL;DR: 1T total / 32B active MoE. 58.6 on SWE-bench Pro. preserve_thinking mode maintains reasoning state across turns. Best-in-class for agent workflows running locally.

Moonshot AI's Kimi K2.6 sits at the top of the May 2026 LiveBench snapshot across both key coding metrics: 78.57 Coding Average and 58.33 Agentic Coding Average. On SWE-bench Pro, it hits 58.6. It's the strongest open-weight model you can actually run locally — with the hardware caveat we'll get to.

The architectural detail that matters for real usage: K2.6 introduces preserve_thinking mode, which maintains full reasoning traces across conversation turns. For complex debugging sessions that span multiple messages, most models effectively forget what they reasoned through three turns ago. K2.6 with preserve_thinking maintains consistent reasoning state — which means less re-explaining context, more coherent multi-step diagnosis. This is a genuine quality-of-life improvement for anyone using the model in an interactive engineering session.

K2.6 also introduces agent swarm orchestration — the ability to coordinate multiple sub-agents in parallel. If you're building a coding pipeline where different agents handle different parts of a codebase simultaneously, this is the model with native architecture support for it.

ollama pull kimi-k2.6

Hardware reality (don't skip this):

K2.6 is a 1-trillion total parameter MoE model. For consumer hardware, you need quantization, and even then you need serious memory. On an M4 Ultra Mac Studio with 128GB unified memory, it runs. On dual RTX 4090s (48GB VRAM), quantized versions work. On a single 24GB card — it doesn't. Don't fight it.

If you have the hardware, K2.6 is the current best local coding model. If you don't, read the next section.

Bottom line: Modified MIT license. HuggingFace weights are available. For teams with multi-GPU setups or M-series Macs with large unified memory, this is the model to run locally. For everyone else, consider it as an API target rather than a local model.

4. Qwen3.6-35B-A3B — The Single-GPU King (Released April 16, 2026)

TL;DR: Apache 2.0. 262K context. 73.4% SWE-bench Verified. 3B active parameters out of 35B total. Runs on a 24GB card or M-series Mac with 32GB. This is the model for most developers.

While the headlines went to the trillion-parameter monsters, Alibaba quietly released Qwen3.6-35B-A3B on April 16, 2026, and it's arguably the most practically useful open coding model for the median developer.

The MoE architecture activates only 3B parameters per token out of 35B total, which means inference cost is close to running a 3B model while drawing on the full 35B parameter space. The result: it fits on consumer hardware, runs at usable speeds, and punches far above its size class on coding benchmarks.

73.4% on SWE-bench Verified — from a model that fits on a single RTX 4090 or M-series Mac with 32GB. That number is competitive with what large cloud models were scoring 12 months ago.

Additional benchmark context:

Terminal-Bench 2.0: 51.5% — strong for its hardware tier
262K native context, extendable to ~1M via Yarn
Thinking preservation (new in this release): retains reasoning context from historical messages, reducing overhead in iterative development sessions
Designed specifically for agentic coding — repository-level reasoning, tool calling, multi-step workflows

# Ollama - Mac and Linux
ollama run qwen3.6:35b-a3b

# Unsloth GGUF - for llama.cpp style inference
# Available on HuggingFace under Qwen/Qwen3.6-35B-A3B-GGUF

The Apache 2.0 license is the cleanest available — no commercial restrictions, no weird carveouts. You can fine-tune it, deploy it in a product, and modify the weights without asking Alibaba for permission.

Who this is for: This is the default recommendation for individual developers and small teams who want a serious local coding model without needing enterprise GPU infrastructure. The hardware bar is a single 24GB card or an M-series Mac. The quality bar is frontier-adjacent.

Bottom line: If you have a single 4090 or a Mac with 32GB+, start here. The tradeoff between performance and hardware requirement is better than any other model on this list.

5. Qwen3-Coder 480B-A35B — The Agentic Powerhouse (Via API)

TL;DR: 480B total / 35B active MoE. Comparable to Claude Sonnet 4 on agentic coding benchmarks. 256K native context, 1M with extrapolation. Free on OpenRouter right now. Run via API, not locally.

Alibaba's Qwen3-Coder, released July 2025, has become the default open-weight model for teams running heavy agentic coding workflows via API. The 480B-A35B-Instruct variant is Alibaba's most powerful agentic coder to date and is specifically designed for the kind of work modern coding agents do: function calling, tool use, and long-context reasoning over entire repositories.

On agentic coding, agentic browser-use, and agentic tool-use benchmarks, Qwen3-Coder sets new state-of-the-art results among open models, described as comparable to Claude Sonnet 4. The 256K native context (extendable to 1M) is appropriate for real-world repositories — not just toy examples.

Alibaba also released Qwen Code alongside the model — a command-line tool for agentic coding forked from Gemini Code, adapted with custom prompts and function calling protocols specifically tuned for Qwen3-Coder.

# Configure in Cline:
# API Provider: OpenAI Compatible
# API Key: your-openrouter-key
# Base URL: https://openrouter.ai/api/v1
# Model: qwen/qwen3-coder-480b-a35b-instruct

At $0.22/M input and $1.00/M output on standard OpenRouter pricing — with a free tier currently available — this is one of the most economical options for high-volume agentic coding pipelines. You're not going to self-host 480B parameters; that requires H100-class multi-GPU infrastructure. The value proposition here is entirely API-based.

Bottom line: For teams using coding agents at scale and optimizing for cost, Qwen3-Coder 480B via API is the obvious choice. The free tier on OpenRouter is worth exploring before committing to paid plans.

6. DeepSeek V4 Flash — Self-Hosting the Frontier (Released April 24, 2026)

TL;DR: MIT license. Two variants: V4-Flash (284B, ~158GB) is self-hostable on 4× A100 or 2× H200. V4-Pro (1.6T) needs a real cluster. Flash delivers 85–95% of Pro quality. Best option if you need frontier-quality code on your own infrastructure.

DeepSeek's V4 released April 24, 2026, with two variants designed for different infrastructure profiles. V4-Pro is the flagship at 1.6T total parameters (49B active); V4-Flash is the practical self-hosting target at 284B total (with FP4+FP8 mixed precision, checkpoint weights ~158GB).

The headline for DeepSeek V4 is the KV cache efficiency: V4 uses only 7% of V3.2's KV cache footprint. That's not a rounding error — it's an architectural improvement that makes serving 1M-context inputs practically feasible without absurd memory requirements.

V4-Flash at Q4 quantization fits on 4× A100 80GB or 2× H200 141GB, plus ~256GB system RAM. The hardware math: weights are ~158GB, full 1M-token KV cache adds ~10GB, runtime overhead a few GB more — total ~170-175GB VRAM. Four A100s give you 320GB, which is headroom, not requirement.

Why self-host V4 at all?

The calculus has shifted in 2026. Regulatory frameworks increasingly demand data residency guarantees. For EU healthcare data, US government contracting, or any organization under strict data sovereignty requirements, "send your code to a third-party API" is not a compliant architecture. V4 Flash under MIT license, running on your own H200 nodes, means no compliance conversation needed.

If you're spending less than ~$15,000/month on inference, the API is almost certainly cheaper than owning the hardware. Above that threshold, self-hosting math starts to work.

# vLLM deployment - requires 4x A100 80GB minimum
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Bottom line: For individual developers, use the DeepSeek API — it's still one of the best cost-per-quality deals available. For organizations with data sovereignty requirements and the GPU budget, V4 Flash self-hosted is the only serious open-weight frontier model that's actually deployable.

7. Codestral 22B — Still the Only Answer for IDE Autocomplete

TL;DR: 95.3% FIM pass@1. 22B fits on consumer hardware. Co-founder of Continue.dev says it's their recommended autocomplete model. Nothing else competes on this specific task.

Nothing on this list has changed Codestral's position for IDE tab completion. Mistral built Codestral specifically for fill-in-the-middle (FIM), which is what actually happens when you hit Tab in VS Code or Neovim — the model sees code before your cursor, code after your cursor, and fills the gap.

Codestral's 95.3% FIM pass@1 (Codestral 25.01) outperforms DeepSeek Coder V2 (83.5%) and Llama 3 70B (81.7%) on the same tasks. The gap isn't small. Ty Dunn, co-founder of Continue.dev, publicly called it their recommended autocomplete model because "code completion constitutes a large portion of the work, which requires models that are great at fill-in-the-middle."

The 22B size is intentional — small enough to deliver sub-100ms completions locally, large enough to maintain quality. That latency matters: autocomplete that takes 500ms breaks developer flow. Codestral at Q4_K_M runs ~14GB and fits comfortably on an RTX 3080 Ti or better.

// Continue.dev config
{
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "codestral:22b",
    "title": "Codestral"
  }
}

ollama pull codestral:22b

Alongside Codestral for autocomplete, Devstral Small 24B (Apache 2.0) remains the best local option for multi-file agentic workflows — multi-file edits, debugging loops, repository-level reasoning. The two are designed to work together: Codestral handles your Tab key, Devstral handles your agent panel.

Bottom line: If your primary use case is IDE autocomplete — and for most developers it is — Codestral 22B is the answer. It was true in early 2026 and it's still true today. Nothing released in the past two months has changed that.

The Hardware Decision Tree (June 2026 Edition)

Be honest about your hardware before you pick a model. Swapping to SSD isn't a model limitation — it's a sizing problem.

8GB VRAM (RTX 4060, laptop GPUs): Qwen3 8B. Competitive programming assistant and code generation, not agentic repo work.

12–16GB VRAM (RTX 3080, RTX 4070): Qwen2.5-Coder 14B or Devstral Small 24B (quantized). Codestral 22B for autocomplete runs fine here too.

24GB VRAM (RTX 3090, RTX 4090, A5000): Qwen3.6-35B-A3B — this is your sweet spot. The 3B active parameter MoE architecture makes it inference-efficient while the 35B total gives you serious quality. Codestral 22B for autocomplete alongside it.

32GB unified memory (Mac M3/M4 Pro, Mac Mini M4 with 32GB): Qwen3.6-35B-A3B at full quality. GGUF via llama.cpp or Ollama.

48–80GB VRAM (dual 4090, A100 40GB, A6000): Kimi-Dev-72B (the original SWE-bench champion at 60.4%) or start looking at Kimi K2.6 quantized.

128GB+ unified (Mac Studio M4 Ultra, Mac Pro M4 Ultra): Kimi K2.6 quantized, GLM-5.1 quantized via KTransformers.

Enterprise GPU cluster (4× A100, 2× H200, 8× H100): DeepSeek V4 Flash or GLM-5.1 full precision. MiniMax M3 once weights ship.

The Practical Stack for June 2026

Stop looking for one model to rule everything. The models that win on agentic coding are not the same ones that win on real-time autocomplete. Here's what a practical setup looks like:

IDE tab completion: Codestral 22B (Ollama + Continue.dev)

Local chat + code review: Qwen3.6-35B-A3B (single 4090 or M-series Mac)

Local agent for complex multi-file work: Devstral Small 24B (via OpenHands or Cline)

API — high-volume agentic pipelines: Qwen3-Coder 480B via OpenRouter (free tier available)

API — best quality, cost-conscious: DeepSeek V4 Pro API (~competitive pricing, MIT model)

Long-horizon autonomous execution: GLM-5.1 via Z.AI API or SiliconFlow

Experimental — early access: MiniMax M3 API (weights pending, verify benchmarks yourself)

What Benchmarks Still Don't Tell You

Three things that don't appear in any leaderboard:

Hallucinated package APIs. Some models confidently call functions with signatures that changed two years ago, or import packages that don't exist in the version your code targets. GLM-5.1 and Qwen3-Coder are relatively strong here; smaller models are worse. Test your specific stack.

Test-passing vs. code-that-looks-right. Kimi-Dev-72B (the predecessor to K2.6) was trained with RL rewards that fire only when test suites pass — not when code compiles, not when it looks plausible. Most other models are still trained on next-token prediction over code that humans wrote, much of which is subtly wrong. The training objective difference is real and shows up when models handle edge cases.

Context degradation over long sessions. GLM-5.1's "8-hour execution" claim is partly an architecture story about not losing coherence over extended sessions. Other models — even with large context windows — tend to produce lower-quality outputs when the context fills up with long conversation history. If your use case involves long autonomous runs, test that specifically, not just single-turn benchmark scores.

The Honest Bottom Line

Open source coding AI in June 2026 is genuinely frontier-competitive. The specific phrasing that was accurate in 2024 — "open source is good enough for most tasks but you'll need proprietary for the hard stuff" — is no longer accurate. GLM-5.1, MiniMax M3, and Kimi K2.6 are competing directly with the previous generation of proprietary frontier models.

That said, "competitive benchmarks" and "production-ready for your specific codebase" are different things. Run the model on your actual tasks, not just leaderboard tasks. The hallucination patterns, context handling, and latency characteristics vary enormously by use case.

The models that aren't on leaderboards — the hardware requirements, the data sovereignty implications, the licensing edge cases — are as important as the benchmark numbers. This guide tries to be honest about both.

Resources

LiveBench Leaderboard — current open-source model rankings with coding and agentic coding breakdowns
SWE-bench.com Viewer — full history of SWE-bench submissions
SWE-rebench Leaderboard — standardized re-evaluation of SWE-bench across models
Kilo.ai Open Source Model Rankings — live leaderboard, updated as models ship
Qwen3-Coder GitHub — weights, Qwen Code CLI, agentic coding tooling
GLM-5.1 on Hugging Face — MIT-licensed weights and deployment docs
Kimi-Dev GitHub — original RL-trained coding model, still relevant for SWE-bench workflows
MiniMax M3 Blog — official technical post (read the caveats section)
Continue.dev — open-source VS Code + JetBrains extension for local LLMs
OpenHands (All Hands AI) — open-source coding agent compatible with Devstral, Kimi, Qwen
Ollama — local model serving with one-line model pulls
OpenRouter — unified API routing for Qwen3-Coder, MiniMax M3, and others