LLMs & Generative AI

Kimi K2.7 Code Launches as an Open Model for Long-Horizon Coding

Moonshot AI has released Kimi K2.7 Code, an open-weight model for long-horizon software engineering. We examine its benchmarks, architecture, API price, license and limitations.

By 10 min read
Laptop screen showing source code, illustrating the Kimi K2.7 Code programming model

Moonshot AI has released Kimi K2.7 Code, a new open-weight model built primarily for agentic coding and long-running engineering tasks. The weights appeared in Moonshot AI's official Hugging Face repository on June 11, 2026, alongside the start of the API launch.

The name deserves some precision. This is not a general-purpose "Kimi 2.7" replacing every previous Kimi product. Its official name is Kimi K2.7 Code, and the suffix describes its priorities: completing multi-stage programming tasks, using tools and following instructions across very long contexts.

The release matters for another reason. Moonshot AI is publishing the model weights and code, offering a relatively inexpensive OpenAI-compatible API, and comparing the system directly with GPT-5.5 and Claude Opus 4.8. The results do not show an open model defeating every closed competitor. They do show a substantial improvement over Kimi K2.6 and a narrowing gap in practical coding-agent workloads.

Kimi K2.7 Code at a glance

  • architecture: Mixture-of-Experts,
  • one trillion total parameters and 32 billion activated per token,
  • context window: 262,144 tokens,
  • inputs: text, images and video,
  • reasoning mode is mandatory,
  • tool use, JSON Mode and automatic context caching,
  • open weights under a Modified MIT License,
  • API price: $0.95 per million uncached input tokens and $4 per million output tokens.

The central upgrade: finishing the whole task

K2.7 Code is built on Kimi K2.6, but Moonshot AI is not presenting it as a minor improvement to code completion. The central objective is stronger end-to-end task completion: taking a software task from initial investigation to a working, verified result.

A coding agent working at that level cannot stop after generating one function. It needs to:

  1. find the relevant files and dependencies,
  2. understand the existing architecture,
  3. plan a change across multiple modules,
  4. use the terminal, tests and other tools,
  5. recognize when an approach has failed,
  6. revise the implementation without losing the original objective.

Moonshot highlights stronger generalization across Rust, Go and Python, as well as frontend development, DevOps, performance optimization and machine learning. That is a broader target than short code-completion benchmarks.

Efficiency is the second meaningful change. According to the model card, K2.7 uses approximately 30% fewer reasoning tokens than K2.6. Less reasoning does not necessarily mean a shallower answer. In long agent runs, efficiency often comes from avoiding repeated analysis, unnecessary reconsideration and extended deliberation over straightforward decisions.

Kimi K2.7 Code benchmarks against GPT-5.5 and Claude Opus 4.8

Moonshot AI published six evaluations covering coding, persistent agent work and tool use. The table below reproduces the figures from the official model card.

BenchmarkKimi K2.6Kimi K2.7 CodeGPT-5.5Claude Opus 4.8
Kimi Code Bench v250.962.069.067.4
Program Bench48.353.669.163.8
MLS Bench Lite26.735.135.542.8
Kimi Claw 24/7 Bench42.946.952.850.4
MCP Atlas69.476.079.481.3
MCP Mark Verified72.881.192.976.4

The primary conclusion is not that Kimi won. GPT-5.5 records the highest score in four of the six rows, while Claude Opus 4.8 leads MLS Bench Lite and MCP Atlas. Kimi K2.7 Code beats Opus on MCP Mark Verified but remains behind GPT-5.5 there.

What matters more is the consistency of the improvement over K2.6:

  • Kimi Code Bench v2 rises from 50.9 to 62.0,
  • Program Bench increases from 48.3 to 53.6,
  • MLS Bench Lite moves from 26.7 to 35.1,
  • MCP Mark Verified improves from 72.8 to 81.1.

The MLS Bench Lite gain is particularly interesting because it evaluates the ability to develop scalable machine-learning methods rather than merely edit web applications. K2.7 nearly matches GPT-5.5 on this test, although Claude Opus 4.8 remains ahead.

Reading the benchmark table without the marketing shortcut

The table comes from the model vendor and needs context. Kimi Code Bench v2 and Kimi Claw 24/7 Bench are internal Moonshot evaluations. The former covers realistic software tasks involving production incidents, infrastructure, security, frontend development and open-source projects. The latter measures persistent agent work across 17 professional scenarios.

The external benchmarks are not a perfectly controlled model-only comparison either. Kimi ran through Kimi Code CLI, GPT-5.5 through Codex and Opus 4.8 through Claude Code. The systems received broadly similar budgets, but each environment has its own tools, system prompts and context-management strategy. The results therefore measure a combination of model and agent harness.

That is useful when the question is, "Which system can finish real work?" It becomes less decisive when selecting a model for one particular repository. A private regression suite built from a team's actual issues, tests and review criteria remains more valuable than a general leaderboard position.

Architecture: one trillion parameters, with only a fraction active

Kimi K2.7 Code uses a Mixture-of-Experts architecture. It contains roughly one trillion parameters in total but activates 32 billion for each token. The model has 384 routed experts, selects eight experts per token and includes one shared expert.

This design increases total capacity without running the entire trillion-parameter network at every step. It does not make K2.7 a lightweight laptop model. Even with open weights, the infrastructure requirements are far beyond common 7B, 32B or 70B local models.

Moonshot publishes native INT4 quantization and supports deployment through vLLM, SGLang and KTransformers. Quantization reduces memory and inference costs, but self-hosting K2.7 remains a multi-accelerator server project for infrastructure teams and model providers.

A 256K context window and reasoning that cannot be disabled

The model supports a 262,144-token context window. That is enough room for substantial parts of a repository, tool history, documentation and intermediate results from a long task. The limit itself does not guarantee good context selection. Long-context agents still benefit from choosing relevant files and compressing older results.

K2.7 Code operates exclusively in reasoning mode. The API returns an error if a client attempts to disable the thinking parameter. It also forces preserve thinking, retaining reasoning content between turns and tool calls.

This creates an important integration requirement: multi-step tool loops must keep the assistant's reasoning_content field in the conversation history. Removing it can break the next request. K2.7 is therefore not always a one-line drop-in replacement in an existing OpenAI-compatible client.

Several generation parameters are fixed:

ParameterKimi K2.7 Code value
temperature1.0
top_p0.95
default max_tokens32,768
thinkingalways enabled
tool_choiceauto or none

Vision and video are part of the coding loop

K2.7 Code includes a 400-million-parameter MoonViT vision encoder. The official API accepts images and video, allowing an agent to inspect interface screenshots, diagrams, visual logs and the rendered result of an application.

The most useful workflow is not simply asking what appears in an image. The model can:

  • receive a design reference,
  • build the interface,
  • launch the application,
  • inspect the rendered result,
  • compare it with the target,
  • revise the code.

This closes the loop between writing code and evaluating its visual output. The official API and self-hosted model do not currently offer exactly the same feature set, however. Video support is described as experimental and is currently limited to Moonshot's official API rather than standard vLLM or SGLang deployments.

API pricing: inexpensive input, but agent reasoning still adds up

Official Kimi K2.7 Code pricing is:

Token typePrice per 1M
Cached input$0.19
Uncached input$0.95
Output$4.00

Automatic caching matters in coding agents because later turns often reuse a large prefix containing repository context, instructions and previous tool results. A cache hit reduces the input rate by a factor of five.

The low input price should not obscure the cost of output and reasoning tokens. A long agent task can execute dozens of steps, call tools and produce a substantial reasoning trace. Teams should compare the cost of a completed task rather than only the headline price per million tokens.

The license is permissive, but it is not standard MIT

The K2.7 Code repository and model weights are released under a Modified MIT License. It permits use, modification, publication, distribution, sublicensing and sale, provided that the copyright notice and license text are retained.

Moonshot adds one significant condition. A commercial product using the model must prominently display "Kimi K2.7 Code" in its interface if it exceeds either 100 million monthly active users or $20 million in monthly revenue.

That threshold will not affect most projects. It is still more precise to describe K2.7 as an open-weight model under a modified MIT license rather than assume that its terms are identical to standard MIT.

How developers can use Kimi K2.7 Code

The simplest route is Moonshot's API, which uses a format compatible with the OpenAI SDK. The model identifier is:

kimi-k2.7-code

Moonshot also documents Anthropic-compatible access and configuration for Claude Code, Cline and Roo Code. The vendor recommends its own Kimi Code CLI, which is the agent harness used for part of the published evaluation.

Open weights make private deployment possible, but the decision should account for:

  • accelerator and server costs,
  • inference-engine maintenance,
  • code-execution security,
  • terminal and data isolation,
  • agent-action logging,
  • model and dependency updates.

Running the weights privately does not automatically create a safe coding agent. The model still needs constrained permissions, sandboxing and controls around commands and external systems. Our secure MCP server guide covers the same principles for agents connected to tools and data.

Is Kimi K2.7 Code a competitor to Claude Code and Codex?

Yes, but primarily as a cost and deployment alternative, not an unconditional benchmark winner.

K2.7 has several compelling advantages:

  • open weights,
  • inexpensive API access,
  • a large context window,
  • multimodal input,
  • clear gains on long-running tasks,
  • the option to deploy outside the vendor's platform.

GPT-5.5 and Claude Opus 4.8 still perform better on most of the published tests. Closed products may also offer more mature agent environments, permission management and cloud integrations.

For a team pursuing the highest completion rate regardless of price, Kimi may not be the first choice. For an organization processing a high volume of tasks, needing deployment control or building a custom coding agent, the pricing difference and weight availability may matter more than a few leaderboard points.

What Kimi K2.7 Code actually changes

The most interesting part of this release is not one benchmark result. Kimi K2.7 Code shows that open coding models are becoming more than inexpensive code generators. They are beginning to compete on long-running work, tool use, context retention and autonomous task completion.

The model does not defeat GPT-5.5 or Opus 4.8 across the board. It is substantially better than K2.6, nearly matches GPT-5.5 on MLS Bench Lite and beats Opus in one MCP evaluation. It combines those gains with open weights, multimodal input and API pricing that makes high-volume use plausible.

The fairest verdict is therefore: Kimi K2.7 Code is not the new king of coding, but it is one of the most interesting open models for teams building their own software-engineering agents. Its real value will be measured not by a short prompt, but by how many actual tasks it completes in a specific repository and at what total cost.

(Photo: Mohammad Rahmani / Unsplash, license.)

Share: