DeepSeek V4 Makes Cheap Million-Token AI Feel Real

DeepSeek V4 arrives with open weights, Pro and Flash variants, a one-million-token context window and aggressive API pricing. The real story is not just speed, but whether cheaper long-context agents change how teams build AI products.

By TreffikAI EditorialMay 4, 20267 min read

DeepSeek is back in the center of the AI conversation. The Chinese lab has released DeepSeek V4 Preview, an open-weight model family built around a promise that sounds simple but could reshape plenty of product roadmaps: strong reasoning, long context and low API prices at the same time.

This is not just another model card with bigger numbers. V4 lands in two variants, DeepSeek-V4-Pro and DeepSeek-V4-Flash, both with a one-million-token context window. That means the model can work with unusually large prompts: long contracts, full research folders, multi-file codebases or months of support conversations without chopping the task into tiny pieces.

The question is whether DeepSeek can turn that technical jump into something developers actually trust in production. The answer, at least for now, is promising but not uncomplicated.

What DeepSeek actually launched

DeepSeek V4 is currently described as a preview release, but the rollout is already broad enough to matter. The models are available through DeepSeek's chat product, API and open-weight release on Hugging Face.

The family has two main public versions:

DeepSeek-V4-Pro: the larger model, with 1.6 trillion total parameters and 49 billion active parameters per token.
DeepSeek-V4-Flash: the faster and cheaper option, with 284 billion total parameters and 13 billion active parameters.

Both are Mixture-of-Experts models, which means only part of the network is activated for a given token. That is how DeepSeek can advertise enormous total parameter counts without paying the full compute cost on every request.

For users, the distinction is straightforward. Pro is the model for harder coding, reasoning, analysis and agent workflows. Flash is the everyday option for cheaper chat, extraction, summaries and lighter automation.

DeepSeek is also retiring older API labels. The company says deepseek-chat and deepseek-reasoner will be fully inaccessible after July 24, 2026, with compatibility currently routing those names to V4-Flash modes.

The million-token context is the headline feature

A one-million-token context window is the feature most teams will notice first. In practical terms, it lets a model see far more material at once before answering.

That matters because many AI workflows fail not because the model cannot reason, but because it cannot see enough. A support bot loses earlier conversation history. A coding assistant forgets the shape of the repository. A research agent keeps asking for documents it already read. A legal or compliance workflow needs to compare clauses across a stack of files.

DeepSeek says V4 uses a new attention design to make long context cheaper. Its model card describes a hybrid attention architecture combining compressed and sparse attention techniques. In the one-million-token setting, DeepSeek claims V4-Pro needs only 27% of the single-token inference FLOPs and 10% of the KV cache required by DeepSeek-V3.2.

Those are the kinds of numbers that matter if you are building long-context tools at scale. A big window is nice in demos. A big window that does not crush memory and latency is what makes the feature commercially useful.

Pricing is the other shock

DeepSeek is pairing the long context with aggressive API prices.

The public pricing page lists V4-Flash at $0.14 per million uncached input tokens and $0.28 per million output tokens. V4-Pro is listed at $1.74 per million uncached input tokens and $3.48 per million output tokens, with a temporary 75% discount running until the end of May 2026.

Cached input is cheaper still. That can make a real difference for products that repeatedly send the same large base context, such as a codebase, policy library or documentation corpus.

The point is not that DeepSeek is the cheapest model for every job. It will not be. The point is that it is forcing buyers to ask a sharper question: if a model is close enough for the task and dramatically cheaper to run, how much premium are teams willing to pay for a closed frontier model?

That question is uncomfortable for every lab selling high-margin inference.

How strong is V4?

DeepSeek's own benchmarks frame V4-Pro as the strongest open model in several areas, especially coding, math, STEM reasoning, long context and agentic tasks. The company says it trails only the very top closed models in some knowledge tests and competes closely with the newest frontier systems.

The model card is also candid in places. It presents V4 as an important bridge toward the frontier, not a clean win over every closed competitor. That nuance matters.

An external evaluation from NIST's Center for AI Standards and Innovation (CAISI) adds a useful counterweight. CAISI describes DeepSeek V4 Pro as the most capable Chinese model it has evaluated so far, but says its aggregate capability appears to lag the leading U.S. frontier by roughly eight months.

CAISI also found a gap between DeepSeek's self-reported benchmark picture and results on held-out or non-public evaluations. In particular, V4 looked weaker on some abstract reasoning, software engineering and cyber tasks than the launch narrative might suggest.

That does not make V4 unimpressive. It makes the story more realistic. DeepSeek has built a very strong open-weight model with unusual cost efficiency. It has not erased the frontier.

Why developers are paying attention

V4 is aimed squarely at developers building agent workflows. DeepSeek says the model has dedicated optimizations for agentic capability and integrates with tools such as Claude Code, OpenClaw and OpenCode.

That is not a random feature bullet. Coding agents are becoming one of the most important battlegrounds in AI because they combine long context, tool use, planning, retrieval and verification. They also burn tokens quickly.

A cheaper model with a long context window can change the economics of these tools. Instead of asking the model to inspect only a few files, a product can afford to give it much more of the repository. Instead of forcing short agent loops, teams can let the model plan, execute, inspect results and retry with more room.

This is where V4-Flash could become especially interesting. If it is good enough for routine agent steps while Pro handles the hardest reasoning, developers can build mixed-model systems that use costlier intelligence only where it is needed.

The Huawei and chip angle

DeepSeek V4 is also a hardware story. Huawei has said its Ascend systems support the model, and several reports frame V4 as a test of whether Chinese AI labs can reduce dependence on Nvidia.

That does not mean DeepSeek has fully moved beyond Nvidia. MIT Technology Review notes that Chinese chips appear better suited for inference than frontier training today, and that V4 may still rely on Nvidia for important parts of the training process. The more grounded reading is that V4 is another step toward a parallel AI stack: Chinese models, Chinese chips, Chinese infrastructure and open-weight distribution.

For developers outside China, the hardware politics may feel distant. It still matters because it shapes price, availability and competition. If DeepSeek can run strong models efficiently on non-Nvidia infrastructure, it adds pressure to the entire AI supply chain.

What to watch before adopting it

V4 is exciting, but it should not be dropped into sensitive workflows casually.

Teams should evaluate it on their own tasks, not only on benchmark tables. Long-context performance can vary wildly depending on retrieval style, prompt format and whether the answer depends on small details buried deep in the input.

Data governance also matters. Using DeepSeek's hosted API is a different risk decision than running open weights on controlled infrastructure. Companies handling regulated data, source code or customer information need a clear policy before experimenting.

Cost also needs testing. Cheap per-token prices do not automatically mean cheap tasks. A one-million-token window makes it easy to send enormous prompts. Without budgets, caching and observability, teams can simply spend less per token while sending far more tokens.

The bigger takeaway

DeepSeek V4 probably will not shock markets the way R1 did in early 2025. The industry is less surprised now, and rival labs have moved fast.

Still, V4 may be the more practical release. It pushes open-weight models toward workflows that matter commercially: huge context, coding agents, low-cost inference and flexible deployment. Even if it sits behind the absolute frontier, it gives developers another serious option in the zone where products are actually built.

That is the part to watch. AI does not have to be the best model on every benchmark to change the market. Sometimes it only needs to be good enough, open enough and cheap enough that teams start designing around it.

Sources: DeepSeek, DeepSeek pricing, Hugging Face, NIST CAISI, MIT Technology Review, Fortune, Euronews. Photo: TreffikAI generated cover.